What are the most significant risk factors associated with hospital readmission rate in hip/knee replacement patients?
Hospitals with better Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) scores will have lower readmission rates for hip/knee replacements because higher patient satisfaction often correlates with better overall care quality and patient outcomes, including reduced complications and better post-discharge support (Edwards et al., 2015).
It is anticipated that patient satisfaction, clinical treatment quality, and operational efficiency will be the most important risk factor categories associated with hospital readmission rates among patients who have had hip or knee replacement surgery. The most important variables influencing readmission rates for patients after hip or knee replacement surgery will most likely be a combination of high HCAHPS scores and reduced rates of clinical complications.
Our preprocessing and feature engineering approach was designed to address the complexities highlighted during exploratory data analysis (EDA). The initial plan involved several key steps:
Preprocessing: We focused on handling missing values, encoding categorical variables, and reducing collinearity. The preprocessing steps included dummy encoding of ordinal data, mapping categorical states to numeric codes, and removing variables with high missingness or redundancy. This approach aimed to ensure that the dataset was clean and suitable for further analysis. We specifically addressed the high number of missing values and collinearity issues as identified in our EDA.
Feature Engineering: We engineered features to consolidate related data and reduce dimensionality. For instance, we created a composite variable for overall mortality scores and removed redundant or highly correlated variables. This strategy aimed to simplify the feature set while retaining critical information, thus enhancing model performance and interpretability.
Initial Model Selection: Based on our EDA findings, we planned to explore multiple methods of unsupervised learning and employ a random forest model. Unsupervised methods, such as KNN for segmentation analysis, were chosen to identify clusters and uncover underlying patterns in the data. The random forest model was selected as an initial supervised learning approach due to its robustness in handling diverse data types and managing feature importance.
To ensure the dataset’s suitability for analysis, several preprocessing steps were employed. First, categorical variables were identified, and appropriate encoding methods were applied. The EDV column, which contains ordinal data, was dummy encoded based on predefined levels, transforming it into a numeric format that reflects its ordinal nature. This approach preserves the inherent order of the data while making it suitable for statistical modeling. Additionally, categorical states were encoded using a numeric mapping to eliminate potential ordinal relationships and standardize the representation of states. This choice was made to facilitate model compatibility and avoid any unintended ordinal implications in the analysis.
Unnecessary or highly correlated variables were then removed to streamline the dataset. Variables with significant missingness or redundancy, such as certain patient survey ratings and summary statistics, were excluded. The decision to remove these variables was guided by their high correlation with other variables, high missingness, or redundancy with already included metrics. For instance, combining various patient survey metrics into a single comprehensive score helped reduce collinearity and simplify the feature set. See table 1 for a list of all variables that were removed and justification for doing so.
| Variable | Justification |
|---|---|
| OP_18c | Removed due to high correlation and low relevance. |
| OP_22 | Removed due to high correlation and low relevance. |
| ED_2_Strata_1 | Removed due to high percentage of missingness. |
| OP_23 | Removed due to high percentage of missingness. |
| VTE_2 | Removed due to high percentage of missingness. |
| STK_02 | Removed as stroke data is not relevant to Hip/Knee Surgery. |
| STK_05 | Removed as stroke data is not relevant to Hip/Knee Surgery. |
| STK_06 | Removed as stroke data is not relevant to Hip/Knee Surgery. |
| HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE | Removed due to strong correlation with overall hospital rating. |
| ExcessReadmissionRatio_HIP-KNEE | Removed due to high correlation with target variable. |
| ExpectedReadmissionRate_HIP-KNEE | Removed due to high correlation with target variable. |
| NumberOfReadmissions_HIP-KNEE | Removed as it is influenced by hospital size, which is not available. |
| Sepsis Variables | Removed due to unclear definition in dataset dictionary. |
| Score_PSI_90 | Removed because it is a summary of other PSI variables, making it redundant. |
| Patient Survey Data | Removed due to collinearity and redundancy with other patient survey metrics. |
Feature engineering focused on creating meaningful variables from existing data and removing irrelevant or redundant features. A notable example is the creation of the Score_Ovr_MORT variable, which aggregates multiple mortality-related scores into a single measure. This new feature consolidates related variables, reducing dimensionality while preserving critical information on mortality rates.
Make sure to briefly recap why you chose this as your initial model if you didn’t earlier (1-2 paragraphs) What methods did you use to perform your initial model? Why did you make the choices that you did? Did you do cross-validation? If yes, describe. What are your model’s assumptions? Did you test for them? If yes, describe.
First, categorical variables were appropriately encoded, and redundant variables were removed, as described in the previous section. Collinearity was greatly reduced, as evident by the heat maps in Figure 1 and Figure 2. Additionally, all rows with NA values for the target variables were removed. We also decided to remove the one facility that had an NA value, which happened to be the same observation with a missing state value. Variables with less than 5% missing data were imputed by the median, as this has a small impact on the variation of the predictors. Any variables with greater than 5% missing values were imputed using kNN to allow for more accurate imputation and mitigate reduction in variation. All steps were repeated on the test dataset, which is the most recent snapshot of the data from April 24, 2024. Finally, descriptive statistics were computed for all variables.
(1 paragraph) Give any pre-processing results (e.g., splitting into testing/training sets, handling of missing values), especially if they are needed to justify decisions. Again, graphs or summary statistics may be used to defend choices and do not count toward the page limit and may be placed in the Appendix.
| mean | sd | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|
| PredictedReadmissionRate_HIP_KNEE | 4.546561e+00 | 0.9093914 | 0.8576841 | 1.9279 | 8.569 | 6.6411 | 0.4373061 | 0.4908058 | 0.0212407 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | 8.672450e+01 | 3.7230744 | 2.9652000 | 66.0000 | 98.000 | 32.0000 | -0.5851721 | 1.5050011 | 0.0869602 |
| EDV | 2.694490e+00 | 1.0250451 | 1.4826000 | 1.0000 | 4.000 | 3.0000 | -0.1385316 | -1.1578643 | 0.0239421 |
| HCP_COVID_19 | 8.887098e+01 | 9.5606361 | 8.3025600 | 32.6000 | 100.000 | 67.4000 | -1.3148000 | 2.1219028 | 0.2233087 |
| IMM_3 | 7.932570e+01 | 17.7478481 | 16.3086000 | 4.0000 | 100.000 | 96.0000 | -1.0958217 | 0.7241943 | 0.4145381 |
| OP_18b | 1.895739e+02 | 49.6783664 | 44.4780000 | 62.0000 | 587.000 | 525.0000 | 0.9817063 | 3.3080181 | 1.1603422 |
| OP_29 | 9.215876e+01 | 11.7440127 | 4.4478000 | 0.0000 | 100.000 | 100.0000 | -3.3057142 | 14.4641144 | 0.2743060 |
| SAFE_USE_OF_OPIOIDS | 1.557229e+01 | 4.2502679 | 2.9652000 | 0.0000 | 45.000 | 45.0000 | 0.5890847 | 3.1474800 | 0.0992739 |
| VTE_1 | 9.107365e+01 | 9.1561257 | 4.4478000 | 5.0000 | 100.000 | 95.0000 | -4.3773341 | 29.2741519 | 0.2138605 |
| Score_COMP_HIP_KNEE | 3.179924e+00 | 0.5539759 | 0.4447800 | 1.6000 | 6.200 | 4.6000 | 0.7280978 | 1.7214783 | 0.0129393 |
| Score_PSI_03 | 5.878123e-01 | 0.5183737 | 0.2816940 | 0.0500 | 6.310 | 6.2600 | 3.5027779 | 21.8351982 | 0.0121077 |
| Score_PSI_04 | 1.688142e+02 | 19.0239898 | 15.6117780 | 86.6800 | 241.810 | 155.1300 | -0.0656641 | 1.2542628 | 0.4443451 |
| Score_PSI_06 | 2.467703e-01 | 0.0452557 | 0.0296520 | 0.1200 | 0.510 | 0.3900 | 0.9085914 | 1.9551905 | 0.0010570 |
| Score_PSI_08 | 8.994540e-02 | 0.0082004 | 0.0000000 | 0.0600 | 0.130 | 0.0700 | 0.5502254 | 1.7580966 | 0.0001915 |
| Score_PSI_09 | 2.514146e+00 | 0.4953103 | 0.3409980 | 1.1000 | 6.100 | 5.0000 | 1.2648156 | 4.8211377 | 0.0115690 |
| Score_PSI_10 | 1.572951e+00 | 0.3731798 | 0.1482600 | 0.4700 | 4.550 | 4.0800 | 1.8287437 | 7.4056616 | 0.0087164 |
| Score_PSI_11 | 8.969864e+00 | 3.1123203 | 2.2239000 | 2.7300 | 49.000 | 46.2700 | 2.7479837 | 22.3393252 | 0.0726947 |
| Score_PSI_12 | 3.582755e+00 | 0.7944605 | 0.6671700 | 1.6100 | 7.510 | 5.9000 | 0.9512312 | 1.6446910 | 0.0185563 |
| Score_PSI_13 | 5.287103e+00 | 1.0312666 | 0.7413000 | 2.1700 | 10.790 | 8.6200 | 1.0438233 | 2.7498825 | 0.0240874 |
| Score_PSI_14 | 2.010540e+00 | 0.3569335 | 0.1927380 | 1.0700 | 4.400 | 3.3300 | 1.9129489 | 6.4857308 | 0.0083369 |
| Score_PSI_15 | 1.102668e+00 | 0.3271882 | 0.2223900 | 0.3500 | 3.430 | 3.0800 | 1.7127063 | 5.4101230 | 0.0076422 |
| FacilityName* | 9.026263e+02 | 517.8533090 | 667.1700000 | 1.0000 | 1796.000 | 1795.0000 | -0.0093756 | -1.2029068 | 12.0955473 |
| State* | 2.469449e+01 | 14.4448000 | 20.7564000 | 1.0000 | 50.000 | 49.0000 | 0.0222112 | -1.3657016 | 0.3373885 |
| Payment_PAYM_90_HIP_KNEE | 2.105666e+04 | 1943.7230953 | 1716.8508000 | 15936.0000 | 34916.000 | 18980.0000 | 0.7645293 | 1.9173982 | 45.3997188 |
| StateCode* | 2.472995e+01 | 14.5368326 | 20.7564000 | 1.0000 | 50.000 | 49.0000 | 0.0160810 | -1.3497989 | 0.3395381 |
| Score_Ovr_MORT | 1.305795e+01 | 1.1615024 | 1.0378200 | 8.1600 | 17.420 | 9.2600 | -0.0667113 | 0.5511229 | 0.0271293 |
NOTE: Make sure you thoroughly describe your unsupervised ML methods. Note: If inappropriate to your data, justification for exclusion may be provided instead.
Give your analysis of your initial model’s (1) algorithm, (2) assumptions, and (3) overfitting for your initial model, using tables, graphs, and statistics to defend choices and interpretations, and do not count toward the page limit. Algorithm: appropriate tables or figures are given to interpret the preliminary model results using evaluation metrics Assumptions: include any tests (e.g., Durbin-Watson), graphs (e.g., residuals plots), heuristics, or other diagnostics as needed Overfitting: control for possible overfitting, which may include but is not limited to: (1) validation with the training and testing sets, (2) cross-validation, (3) regularization or penalties, and/or (4) additional feature selection This will be performed in greater detail next week, so you may just give a cursory treatment to this to save you time this week. (~ 1 paragraph)
(1-2 paragraphs) Include a summary of your key takeaways from both the Pre-processing & feature selection, as well as the initial model you ran Make sure to refer back to the question you’ve proposed Refer back to what you initially proposed and revise if needed! Include justifications for changes, including referring back to sections, tables, or graphs in your text. (1-2 paragraphs) Includes a plan for next steps: Model Selection & Tuning Plan What additional models will you run? How will you validate them? What hyperparameter tuning is needed? Did this analysis change anything in your Analysis Plan?
| Measure_ID | Description |
|---|---|
| FacilityId* | Unique facility identifier. |
| FacilityName | Name of the facility. |
| State | State where the facility is located. |
| ExcessReadmissionRatio_HIP-KNEE | The ratio of the predicted readmission rate to the expected readmission rate, based on an average hospital with similar patients. Performance is compared against a ratio of one, such that below one is better and above one is worse in terms of readmission rates. |
| PredictedReadmissionRate_HIP_KNEE* | The number of readmissions within 30 days predicted based on the hospital’s performance with its observed case mix. The predicted number of readmissions is estimated using a hospital-specific intercept, and is intended to reflect the annual expected performance of the hospital given its historical case and patient mix and performance. |
| ExpectedReadmissionRate_HIP-KNEE | The expected number of readmissions in each hospital is estimated using its patient mix and an average hospital-specific intercept. It is thus indirectly standardized to other hospitals with similar case and patient mixes. |
| NumberOfReadmissions_HIP-KNEE | Crude number of readmissions in each hospital within 30 days. |
| H_HSP_RATING_LINEAR_SCORE* | Overall hospital rating - linear mean score. Employs all survey response items in each HCAHPS measure and are converted and combined into a 0-100 linear-scaled measure score. |
| H_RECMND_LINEAR_SCORE | Recommend hospital - linear mean score. From question: Would you recommend this hospital to your friends and family? |
| EDV* | Emergency department volume. Number based on the volume of patients submitted by a hospital used for the measure OP-22: Left without Being Seen. |
| ED_2 | Average (median) admit decision time to time of departure from the emergency department for emergency department patients admitted to inpatient status. |
| IMM_3* | Healthcare workers given influenza vaccination. |
| HCP_COVID_19* | COVID-19 vaccination coverage among healthcare providers. |
| OP_18b* | Average (median) time patients spent in the emergency department before leaving from the visit. |
| OP_18c | Average time patients spent in the emergency department before being sent home (Median Time from ED Arrival to ED Departure for Discharged ED Patients – Psychiatric/Mental Health Patients). |
| OP_22 | Percentage of patients who left the emergency department before being seen. |
| OP_23 | Percentage of patients who came to the emergency department with stroke symptoms who received brain scan results within 45 minutes of arrival. |
| OP_29* | Percentage of patients receiving appropriate recommendation for follow-up screening colonoscopy. |
| SAFE_USE_OF_OPIOIDS* | Percentage of patients who were prescribed 2 or more opioids or an opioid and benzodiazepine concurrently at discharge. |
| SEP_1 | Severe sepsis and septic shock. |
| SEP_SH_3HR | Septic shock 3 hour. |
| SEP_SH_6HR | Septic shock 6 hour. |
| SEV_SEP_3HR | Severe sepsis 3 hour. |
| SEV_SEP_6HR | Severe sepsis 6 hour. |
| STK_02 | Percentage of ischemic stroke patients prescribed or continuing to take antithrombotic therapy at hospital discharge. |
| STK_05 | Percentage of ischemic stroke patients administered antithrombotic therapy by the end of hospital day 2. |
| STK_06 | Percentage of ischemic stroke patients who are prescribed or continuing to take statin medication at hospital discharge. |
| VTE_1* | Percentage of patients that received VTE prophylaxis after hospital admission or surgery. |
| VTE_2 | Percentage of patients that received VTE prophylaxis after being admitted to the intensive care unit (ICU). |
| Score_COMP_HIP_KNEE* | Rate of complications for hip/knee replacement patients. |
| Score_MORT_30_AMI | Death rate for heart attack patients. |
| Score_MORT_30_COPD | Death rate for chronic obstructive pulmonary disease (COPD) patients. |
| Score_MORT_30_HF | Death rate for heart failure patients. |
| Score_MORT_30_PN | Death rate for pneumonia patients. |
| Score_MORT_30_STK | Death rate for stroke patients. |
| Score_Ovr_MORT* | Summary measure (row-wise mean) of Score_MORT_30_AMI, Score_MORT_30_COPD, Score_MORT_30_HF, Score_MORT_30_PN, and Score_MORT_30_STK. |
| Score_PSI_03* | Rate of pressure sores. |
| Score_PSI_04* | Deaths among patients with serious treatable complications after surgery. |
| Score_PSI_06* | Collapsed lung due to medical treatment. |
| Score_PSI_08* | Broken hip from a fall after surgery. |
| Score_PSI_09* | Postoperative hemorrhage or hematoma rate. |
| Score_PSI_10* | Kidney and diabetic complications after surgery. |
| Score_PSI_11* | Respiratory failure after surgery. |
| Score_PSI_12* | Serious blood clots after surgery. |
| Score_PSI_13* | Blood stream infection after surgery. |
| Score_PSI_14* | A wound that splits open after surgery on the abdomen or pelvis. |
| Score_PSI_15* | Accidental cuts and tears from medical treatment. |
| Score_PSI_90 | Serious complications (this is a composite or summary measure). |
| Payment_PAYM_90_HIP_KNEE* | Payment for hip/knee replacement - estimates of payments associated with a 90-day episode of care for hip/knee replacement. |
# Create function to find categorical variables
is_categorical <- function(x) is.factor(x) | is.character(x)
# Apply function to all variables in the dataset
categorical_vars <- sapply(HipKneeClean, is_categorical)
# Print the names of all categorical variables
categorical <- names(HipKneeClean)[categorical_vars]
categorical
## [1] "FacilityId" "EDV" "FacilityName" "State"
# Define the encoding mapping (ignore NAs for now)
encoding_map <- c(
'low' = 1,
'medium' = 2,
'high' = 3,
'very high' = 4
)
# Dummy encoding used due to ordinal nature of this data
# Create a copy of HipKneeClean and name it HipKneeTrain to separate cleaned dataset and the training dataset
HipKneeTrain <- HipKneeClean %>%
mutate(EDV = recode(EDV, !!!encoding_map))
# Print first 20 rows of EDV column in HipKneeClean and HipKneeTrain to ensure proper encoding
cat("HipKneeClean")
## HipKneeClean
print(head(HipKneeClean$EDV, 20))
## [1] "high" "high" "high" "low" "low" "high"
## [7] "low" "medium" "low" "medium" "low" "low"
## [13] "high" "high" "very high" "very high" "low" "high"
## [19] "low" "very high"
cat("HipKneeTrain")
## HipKneeTrain
print(head(HipKneeTrain$EDV, 20))
## [1] 3 3 3 1 1 3 1 2 1 2 1 1 3 3 4 4 1 3 1 4
# Manually map out each state with their respective code in alphabetical order with a preceding 0 to make data non-ordinal
state_mapping <- c(
"AL" = "001",
"AK" = "002",
"AZ" = "003",
"AR" = "004",
"CA" = "005",
"CO" = "006",
"CT" = "007",
"DE" = "008",
"FL" = "009",
"GA" = "010",
"HI" = "011",
"ID" = "012",
"IL" = "013",
"IN" = "014",
"IA" = "015",
"KS" = "016",
"KY" = "017",
"LA" = "018",
"ME" = "019",
"MD" = "020",
"MA" = "021",
"MI" = "022",
"MN" = "023",
"MS" = "024",
"MO" = "025",
"MT" = "026",
"NE" = "027",
"NV" = "028",
"NH" = "029",
"NJ" = "030",
"NM" = "031",
"NY" = "032",
"NC" = "033",
"ND" = "034",
"OH" = "035",
"OK" = "036",
"OR" = "037",
"PA" = "038",
"RI" = "039",
"SC" = "040",
"SD" = "041",
"TN" = "042",
"TX" = "043",
"UT" = "044",
"VT" = "045",
"VA" = "046",
"WA" = "047",
"WV" = "048",
"WI" = "049",
"WY" = "050"
)
# Create new "StateCode" column with the encoded values
HipKneeTrain <- HipKneeTrain %>%
mutate(StateCode = state_mapping[State])
# Print 100 rows of the "State" and "StateCode" columns to ensure accuracy
print("State and StateCode Columns")
## [1] "State and StateCode Columns"
print(head(HipKneeTrain[c("State", "StateCode")], 100))
## # A tibble: 100 × 2
## State StateCode
## <chr> <chr>
## 1 AL 001
## 2 AL 001
## 3 AL 001
## 4 AL 001
## 5 AL 001
## 6 AL 001
## 7 AL 001
## 8 AL 001
## 9 AL 001
## 10 AL 001
## # ℹ 90 more rows
# Print all unique values in "StateCode" column to ensure accuracy
print("Unique StateCode Values")
## [1] "Unique StateCode Values"
print(unique(HipKneeTrain$StateCode))
## [1] "001" "002" "003" "004" "005" "006" "007" "008" NA "009" "010" "011"
## [13] "012" "013" "014" "015" "016" "017" "018" "019" "020" "021" "022" "023"
## [25] "024" "025" "026" "027" "028" "029" "030" "031" "032" "033" "034" "035"
## [37] "036" "037" "038" "039" "040" "041" "042" "043" "044" "045" "046" "047"
## [49] "048" "049" "050"
# Compute correlation matrix
cor_matrix <- cor(HipKneeTrain %>% select_if(is.numeric), use = "pairwise.complete.obs")
# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)
# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")
# Convert correlation matrix to df
cor_table <- as.data.frame(cor_matrix)
# Add variable names as a column
cor_table$Variable <- rownames(cor_table)
# Reorder columns
cor_table <- cor_table %>%
select(Variable, everything())
# Print table
cor_table %>%
kable(caption = "Table 8. Correlation Coefficients Table") %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
| Variable | PredictedReadmissionRate_HIP-KNEE | HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | EDV | HCP_COVID_19 | IMM_3 | OP_18b | OP_29 | SAFE_USE_OF_OPIOIDS | VTE_1 | Score_COMP_HIP_KNEE | Score_MORT_30_AMI | Score_MORT_30_COPD | Score_MORT_30_HF | Score_MORT_30_PN | Score_MORT_30_STK | Score_PSI_03 | Score_PSI_04 | Score_PSI_06 | Score_PSI_08 | Score_PSI_09 | Score_PSI_10 | Score_PSI_11 | Score_PSI_12 | Score_PSI_13 | Score_PSI_14 | Score_PSI_15 | Payment_PAYM_90_HIP_KNEE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PredictedReadmissionRate_HIP-KNEE | PredictedReadmissionRate_HIP-KNEE | 1.0000000 | -0.2060912 | 0.1986939 | -0.0563082 | -0.0028840 | 0.1295727 | -0.0106510 | 0.1063002 | 0.0654668 | 0.3208550 | 0.0074065 | -0.0794948 | -0.1067828 | -0.0985660 | -0.0376746 | -0.0037334 | -0.0449077 | 0.0154891 | -0.0214412 | -0.0182303 | 0.0710046 | 0.1130121 | 0.1047402 | 0.1193336 | 0.0140012 | -0.0158282 | 0.2975679 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | -0.2060912 | 1.0000000 | -0.2262341 | 0.0302154 | 0.2182505 | -0.2448842 | 0.0865028 | 0.1100002 | -0.0248375 | -0.1091775 | -0.0952111 | -0.0230632 | 0.0300940 | -0.0702915 | -0.0760295 | -0.0499063 | -0.0948401 | -0.0054742 | -0.0590998 | 0.0707774 | -0.0157615 | -0.1615833 | -0.0674588 | -0.1458284 | -0.0172231 | 0.0304634 | -0.2108956 |
| EDV | EDV | 0.1986939 | -0.2262341 | 1.0000000 | 0.1599806 | 0.0038674 | 0.5918897 | 0.0603877 | -0.1223889 | 0.2992859 | -0.0240093 | -0.0687401 | -0.0840621 | -0.2588739 | -0.1904351 | -0.0754281 | 0.0292819 | -0.0438108 | -0.0308989 | -0.1637017 | -0.0181523 | 0.0837980 | 0.0455190 | 0.0749218 | 0.0724936 | 0.0399935 | 0.0144400 | 0.0053946 |
| HCP_COVID_19 | HCP_COVID_19 | -0.0563082 | 0.0302154 | 0.1599806 | 1.0000000 | 0.3203622 | 0.2574291 | 0.1067941 | -0.0812735 | 0.0241622 | -0.0510683 | -0.0869890 | -0.1128278 | -0.1245435 | -0.1523779 | -0.0988833 | 0.0943953 | 0.0417007 | 0.0225916 | -0.0272232 | 0.0549990 | 0.0009222 | -0.0909811 | 0.1091949 | -0.0160408 | 0.0169512 | 0.0430812 | -0.0627505 |
| IMM_3 | IMM_3 | -0.0028840 | 0.2182505 | 0.0038674 | 0.3203622 | 1.0000000 | 0.1105628 | 0.1317922 | 0.0410289 | 0.0906329 | -0.0212916 | -0.0165321 | -0.0616051 | -0.0010634 | -0.0761104 | 0.0146397 | 0.0451508 | 0.0601831 | 0.0544576 | -0.0311579 | 0.0899361 | 0.0412713 | -0.0625676 | 0.0594933 | -0.0250714 | 0.0723872 | 0.0625753 | -0.0720431 |
| OP_18b | OP_18b | 0.1295727 | -0.2448842 | 0.5918897 | 0.2574291 | 0.1105628 | 1.0000000 | 0.0506067 | -0.1400845 | 0.2344307 | -0.0293698 | -0.0678837 | -0.1527290 | -0.2195933 | -0.1858291 | -0.0905644 | 0.0583644 | 0.0638412 | 0.0187794 | -0.0806516 | 0.0224833 | 0.0544528 | -0.0076797 | 0.1653812 | 0.0993570 | 0.0676972 | 0.0374530 | -0.0241965 |
| OP_29 | OP_29 | -0.0106510 | 0.0865028 | 0.0603877 | 0.1067941 | 0.1317922 | 0.0506067 | 1.0000000 | -0.0650231 | 0.1567526 | -0.0096464 | -0.0600569 | 0.0081252 | 0.0099705 | -0.0536184 | -0.0251654 | -0.0032584 | 0.0312780 | 0.0059688 | -0.0199006 | 0.0150699 | 0.0331569 | -0.0837910 | -0.0108546 | -0.0087678 | 0.0163779 | 0.0419068 | -0.0815209 |
| SAFE_USE_OF_OPIOIDS | SAFE_USE_OF_OPIOIDS | 0.1063002 | 0.1100002 | -0.1223889 | -0.0812735 | 0.0410289 | -0.1400845 | -0.0650231 | 1.0000000 | -0.0563373 | -0.0081923 | -0.0643353 | -0.0362573 | 0.0171605 | -0.0204107 | -0.0804707 | -0.0287158 | -0.0995591 | -0.0603629 | -0.0017558 | -0.0043396 | -0.0382369 | 0.0137283 | -0.0380145 | -0.0258097 | -0.0166146 | -0.0268805 | -0.0048449 |
| VTE_1 | VTE_1 | 0.0654668 | -0.0248375 | 0.2992859 | 0.0241622 | 0.0906329 | 0.2344307 | 0.1567526 | -0.0563373 | 1.0000000 | -0.0526925 | -0.0493931 | -0.0282911 | -0.1051171 | -0.1710235 | -0.1238916 | -0.0378500 | -0.1180160 | -0.0262166 | -0.0522876 | -0.0577213 | -0.0037803 | -0.0301775 | -0.0317534 | -0.0440578 | 0.0086141 | 0.0326658 | -0.1256363 |
| Score_COMP_HIP_KNEE | Score_COMP_HIP_KNEE | 0.3208550 | -0.1091775 | -0.0240093 | -0.0510683 | -0.0212916 | -0.0293698 | -0.0096464 | -0.0081923 | -0.0526925 | 1.0000000 | 0.0830479 | -0.0203930 | -0.0007242 | 0.0241066 | 0.0211621 | 0.0498557 | 0.0038509 | 0.0505415 | 0.0577776 | 0.0540124 | 0.0813038 | 0.1279724 | 0.1458258 | 0.1334619 | 0.0498603 | 0.0433809 | 0.3410864 |
| Score_MORT_30_AMI | Score_MORT_30_AMI | 0.0074065 | -0.0952111 | -0.0687401 | -0.0869890 | -0.0165321 | -0.0678837 | -0.0600569 | -0.0643353 | -0.0493931 | 0.0830479 | 1.0000000 | 0.2498600 | 0.3407616 | 0.3309425 | 0.2222539 | 0.0415523 | 0.2105379 | 0.0885083 | 0.1010348 | 0.0889343 | 0.1066619 | 0.1037006 | 0.0492328 | 0.0467554 | 0.0454462 | 0.0297688 | 0.0591548 |
| Score_MORT_30_COPD | Score_MORT_30_COPD | -0.0794948 | -0.0230632 | -0.0840621 | -0.1128278 | -0.0616051 | -0.1527290 | 0.0081252 | -0.0362573 | -0.0282911 | -0.0203930 | 0.2498600 | 1.0000000 | 0.3844105 | 0.3710744 | 0.2038243 | -0.0069743 | 0.1713379 | 0.0478268 | 0.0397571 | 0.0429090 | 0.0320669 | 0.0426574 | -0.0532586 | 0.0026944 | 0.0734846 | 0.0340007 | -0.0406696 |
| Score_MORT_30_HF | Score_MORT_30_HF | -0.1067828 | 0.0300940 | -0.2588739 | -0.1245435 | -0.0010634 | -0.2195933 | 0.0099705 | 0.0171605 | -0.1051171 | -0.0007242 | 0.3407616 | 0.3844105 | 1.0000000 | 0.4479367 | 0.3147371 | 0.0371596 | 0.2556384 | 0.0679149 | 0.1051698 | 0.0707269 | 0.0383771 | 0.0362529 | -0.0300702 | -0.0086832 | 0.0647245 | 0.0342374 | -0.0350247 |
| Score_MORT_30_PN | Score_MORT_30_PN | -0.0985660 | -0.0702915 | -0.1904351 | -0.1523779 | -0.0761104 | -0.1858291 | -0.0536184 | -0.0204107 | -0.1710235 | 0.0241066 | 0.3309425 | 0.3710744 | 0.4479367 | 1.0000000 | 0.3042563 | 0.0303815 | 0.2301195 | 0.0543554 | 0.0884315 | 0.0217880 | 0.0237048 | 0.0704445 | 0.0089560 | 0.0393676 | 0.0464407 | 0.0029691 | -0.0062985 |
| Score_MORT_30_STK | Score_MORT_30_STK | -0.0376746 | -0.0760295 | -0.0754281 | -0.0988833 | 0.0146397 | -0.0905644 | -0.0251654 | -0.0804707 | -0.1238916 | 0.0211621 | 0.2222539 | 0.2038243 | 0.3147371 | 0.3042563 | 1.0000000 | 0.0687216 | 0.2380935 | 0.0878847 | 0.1014879 | 0.0674377 | 0.0622532 | 0.0725381 | 0.0474896 | 0.0513975 | 0.0492194 | 0.0625191 | -0.0272101 |
| Score_PSI_03 | Score_PSI_03 | -0.0037334 | -0.0499063 | 0.0292819 | 0.0943953 | 0.0451508 | 0.0583644 | -0.0032584 | -0.0287158 | -0.0378500 | 0.0498557 | 0.0415523 | -0.0069743 | 0.0371596 | 0.0303815 | 0.0687216 | 1.0000000 | 0.1353085 | 0.0601750 | 0.0636661 | 0.1407342 | 0.0386211 | 0.0114365 | 0.1186788 | 0.0298580 | 0.0596798 | 0.0999683 | 0.0086745 |
| Score_PSI_04 | Score_PSI_04 | -0.0449077 | -0.0948401 | -0.0438108 | 0.0417007 | 0.0601831 | 0.0638412 | 0.0312780 | -0.0995591 | -0.1180160 | 0.0038509 | 0.2105379 | 0.1713379 | 0.2556384 | 0.2301195 | 0.2380935 | 0.1353085 | 1.0000000 | 0.0601419 | 0.0870693 | 0.1059485 | 0.0523892 | 0.0649032 | 0.0782559 | 0.0123489 | 0.0652098 | 0.1018205 | -0.0766302 |
| Score_PSI_06 | Score_PSI_06 | 0.0154891 | -0.0054742 | -0.0308989 | 0.0225916 | 0.0544576 | 0.0187794 | 0.0059688 | -0.0603629 | -0.0262166 | 0.0505415 | 0.0885083 | 0.0478268 | 0.0679149 | 0.0543554 | 0.0878847 | 0.0601750 | 0.0601419 | 1.0000000 | 0.0724291 | 0.1014588 | 0.0516246 | 0.0351464 | 0.1431056 | 0.0509831 | 0.0527115 | 0.0910520 | 0.0456525 |
| Score_PSI_08 | Score_PSI_08 | -0.0214412 | -0.0590998 | -0.1637017 | -0.0272232 | -0.0311579 | -0.0806516 | -0.0199006 | -0.0017558 | -0.0522876 | 0.0577776 | 0.1010348 | 0.0397571 | 0.1051698 | 0.0884315 | 0.1014879 | 0.0636661 | 0.0870693 | 0.0724291 | 1.0000000 | 0.0052449 | -0.0360093 | 0.0198090 | 0.0394605 | 0.0093444 | 0.0228045 | 0.0127268 | -0.0041983 |
| Score_PSI_09 | Score_PSI_09 | -0.0182303 | 0.0707774 | -0.0181523 | 0.0549990 | 0.0899361 | 0.0224833 | 0.0150699 | -0.0043396 | -0.0577213 | 0.0540124 | 0.0889343 | 0.0429090 | 0.0707269 | 0.0217880 | 0.0674377 | 0.1407342 | 0.1059485 | 0.1014588 | 0.0052449 | 1.0000000 | 0.0885278 | 0.0680540 | 0.1732337 | 0.0519119 | 0.1207438 | 0.2197254 | -0.0237660 |
| Score_PSI_10 | Score_PSI_10 | 0.0710046 | -0.0157615 | 0.0837980 | 0.0009222 | 0.0412713 | 0.0544528 | 0.0331569 | -0.0382369 | -0.0037803 | 0.0813038 | 0.1066619 | 0.0320669 | 0.0383771 | 0.0237048 | 0.0622532 | 0.0386211 | 0.0523892 | 0.0516246 | -0.0360093 | 0.0885278 | 1.0000000 | 0.1626632 | 0.1079488 | 0.2303938 | 0.0453739 | 0.0830134 | 0.0497447 |
| Score_PSI_11 | Score_PSI_11 | 0.1130121 | -0.1615833 | 0.0455190 | -0.0909811 | -0.0625676 | -0.0076797 | -0.0837910 | 0.0137283 | -0.0301775 | 0.1279724 | 0.1037006 | 0.0426574 | 0.0362529 | 0.0704445 | 0.0725381 | 0.0114365 | 0.0649032 | 0.0351464 | 0.0198090 | 0.0680540 | 0.1626632 | 1.0000000 | 0.1172504 | 0.2506376 | -0.0093577 | 0.0464067 | 0.1441986 |
| Score_PSI_12 | Score_PSI_12 | 0.1047402 | -0.0674588 | 0.0749218 | 0.1091949 | 0.0594933 | 0.1653812 | -0.0108546 | -0.0380145 | -0.0317534 | 0.1458258 | 0.0492328 | -0.0532586 | -0.0300702 | 0.0089560 | 0.0474896 | 0.1186788 | 0.0782559 | 0.1431056 | 0.0394605 | 0.1732337 | 0.1079488 | 0.1172504 | 1.0000000 | 0.1742084 | 0.0522204 | 0.1358951 | 0.0655557 |
| Score_PSI_13 | Score_PSI_13 | 0.1193336 | -0.1458284 | 0.0724936 | -0.0160408 | -0.0250714 | 0.0993570 | -0.0087678 | -0.0258097 | -0.0440578 | 0.1334619 | 0.0467554 | 0.0026944 | -0.0086832 | 0.0393676 | 0.0513975 | 0.0298580 | 0.0123489 | 0.0509831 | 0.0093444 | 0.0519119 | 0.2303938 | 0.2506376 | 0.1742084 | 1.0000000 | 0.0056987 | 0.0878105 | 0.0949467 |
| Score_PSI_14 | Score_PSI_14 | 0.0140012 | -0.0172231 | 0.0399935 | 0.0169512 | 0.0723872 | 0.0676972 | 0.0163779 | -0.0166146 | 0.0086141 | 0.0498603 | 0.0454462 | 0.0734846 | 0.0647245 | 0.0464407 | 0.0492194 | 0.0596798 | 0.0652098 | 0.0527115 | 0.0228045 | 0.1207438 | 0.0453739 | -0.0093577 | 0.0522204 | 0.0056987 | 1.0000000 | 0.1176726 | -0.0181150 |
| Score_PSI_15 | Score_PSI_15 | -0.0158282 | 0.0304634 | 0.0144400 | 0.0430812 | 0.0625753 | 0.0374530 | 0.0419068 | -0.0268805 | 0.0326658 | 0.0433809 | 0.0297688 | 0.0340007 | 0.0342374 | 0.0029691 | 0.0625191 | 0.0999683 | 0.1018205 | 0.0910520 | 0.0127268 | 0.2197254 | 0.0830134 | 0.0464067 | 0.1358951 | 0.0878105 | 0.1176726 | 1.0000000 | -0.0467071 |
| Payment_PAYM_90_HIP_KNEE | Payment_PAYM_90_HIP_KNEE | 0.2975679 | -0.2108956 | 0.0053946 | -0.0627505 | -0.0720431 | -0.0241965 | -0.0815209 | -0.0048449 | -0.1256363 | 0.3410864 | 0.0591548 | -0.0406696 | -0.0350247 | -0.0062985 | -0.0272101 | 0.0086745 | -0.0766302 | 0.0456525 | -0.0041983 | -0.0237660 | 0.0497447 | 0.1441986 | 0.0655557 | 0.0949467 | -0.0181150 | -0.0467071 | 1.0000000 |
# Change - to _ in HIP-KNEE
colnames(HipKneeTrain) <- gsub("-", "_", colnames(HipKneeTrain))
# Remove all NA values in target variable "PredictedReadmissionRate_HIP_KNEE"
HipKneeTrain <- HipKneeTrain %>% filter(!is.na(PredictedReadmissionRate_HIP_KNEE))
# Remove all NA values in the "State", "StateCode", and "FacilityName" columns
HipKneeTrain <- HipKneeTrain %>% drop_na(State, StateCode, FacilityName)
# Print number of remaining variables and observations
dimensions <- dim(HipKneeTrain)
cat("Number of variables:", dimensions[2], "\n")
## Number of variables: 31
cat("Number of observations:", dimensions[1], "\n")
## Number of observations: 1833
We decided to remove the one facility that had an NA value, which also happened to be the same observation with a missing state value.
# Calculate missing values
missing_values_summary <- HipKneeTrain %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTrain)) * 100)
# Print table
missing_values_summary %>%
kable(caption = "Table 7. Missing Values Summary") %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
| Variable | Missing_Count | Missing_Percentage |
|---|---|---|
| FacilityId | 0 | 0.0000000 |
| PredictedReadmissionRate_HIP_KNEE | 0 | 0.0000000 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | 33 | 1.8003273 |
| EDV | 90 | 4.9099836 |
| HCP_COVID_19 | 16 | 0.8728860 |
| IMM_3 | 16 | 0.8728860 |
| OP_18b | 75 | 4.0916530 |
| OP_29 | 222 | 12.1112930 |
| SAFE_USE_OF_OPIOIDS | 69 | 3.7643208 |
| VTE_1 | 994 | 54.2280415 |
| Score_COMP_HIP_KNEE | 40 | 2.1822149 |
| Score_MORT_30_AMI | 405 | 22.0949264 |
| Score_MORT_30_COPD | 247 | 13.4751773 |
| Score_MORT_30_HF | 141 | 7.6923077 |
| Score_MORT_30_PN | 125 | 6.8194217 |
| Score_MORT_30_STK | 284 | 15.4937261 |
| Score_PSI_03 | 8 | 0.4364430 |
| Score_PSI_04 | 575 | 31.3693399 |
| Score_PSI_06 | 2 | 0.1091107 |
| Score_PSI_08 | 2 | 0.1091107 |
| Score_PSI_09 | 2 | 0.1091107 |
| Score_PSI_10 | 41 | 2.2367703 |
| Score_PSI_11 | 40 | 2.1822149 |
| Score_PSI_12 | 2 | 0.1091107 |
| Score_PSI_13 | 42 | 2.2913257 |
| Score_PSI_14 | 87 | 4.7463175 |
| Score_PSI_15 | 29 | 1.5821058 |
| FacilityName | 0 | 0.0000000 |
| State | 0 | 0.0000000 |
| Payment_PAYM_90_HIP_KNEE | 42 | 2.2913257 |
| StateCode | 0 | 0.0000000 |
# Calculate median for columns with <5% missing values
numeric_vars_low_missing <- c("HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE", "EDV", "HCP_COVID_19", "IMM_3", "OP_18b", "SAFE_USE_OF_OPIOIDS", "Score_COMP_HIP_KNEE", "Score_PSI_03", "Score_PSI_06", "Score_PSI_08", "Score_PSI_09", "Score_PSI_10", "Score_PSI_11", "Score_PSI_12", "Score_PSI_13", "Score_PSI_14", "Score_PSI_15", "Payment_PAYM_90_HIP_KNEE")
for (var in numeric_vars_low_missing) {
HipKneeTrain[[var]][is.na(HipKneeTrain[[var]])] <- median(HipKneeTrain[[var]], na.rm = TRUE)
}
# Select high missingness variables for KNN imputation
vars_for_knn <- c("VTE_1", "Score_MORT_30_AMI", "Score_MORT_30_COPD", "Score_MORT_30_HF", "Score_MORT_30_PN", "Score_MORT_30_STK", "Score_PSI_04", "OP_29")
# Perform KNN imputation
HipKneeTrain_knn <- kNN(HipKneeTrain, variable = vars_for_knn, k = 5)
# Remove columns created by the KNN function
HipKneeTrain_knn <- HipKneeTrain_knn %>% select(-ends_with("_imp"))
# Update HipKneeTrain with imputed values
HipKneeTrain[vars_for_knn] <- HipKneeTrain_knn[vars_for_knn]
# Calculate missing values
missing_values_summary <- HipKneeTrain %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTrain)) * 100)
# Print table
missing_values_summary %>%
kable(caption = "Table 7. Missing Values Summary") %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
| Variable | Missing_Count | Missing_Percentage |
|---|---|---|
| FacilityId | 0 | 0 |
| PredictedReadmissionRate_HIP_KNEE | 0 | 0 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | 0 | 0 |
| EDV | 0 | 0 |
| HCP_COVID_19 | 0 | 0 |
| IMM_3 | 0 | 0 |
| OP_18b | 0 | 0 |
| OP_29 | 0 | 0 |
| SAFE_USE_OF_OPIOIDS | 0 | 0 |
| VTE_1 | 0 | 0 |
| Score_COMP_HIP_KNEE | 0 | 0 |
| Score_MORT_30_AMI | 0 | 0 |
| Score_MORT_30_COPD | 0 | 0 |
| Score_MORT_30_HF | 0 | 0 |
| Score_MORT_30_PN | 0 | 0 |
| Score_MORT_30_STK | 0 | 0 |
| Score_PSI_03 | 0 | 0 |
| Score_PSI_04 | 0 | 0 |
| Score_PSI_06 | 0 | 0 |
| Score_PSI_08 | 0 | 0 |
| Score_PSI_09 | 0 | 0 |
| Score_PSI_10 | 0 | 0 |
| Score_PSI_11 | 0 | 0 |
| Score_PSI_12 | 0 | 0 |
| Score_PSI_13 | 0 | 0 |
| Score_PSI_14 | 0 | 0 |
| Score_PSI_15 | 0 | 0 |
| FacilityName | 0 | 0 |
| State | 0 | 0 |
| Payment_PAYM_90_HIP_KNEE | 0 | 0 |
| StateCode | 0 | 0 |
# Average death rates amongst mortality variables and create new column "Score_Ovr_MORT"
HipKneeTrain$Score_Ovr_MORT <- rowMeans(HipKneeTrain[, c("Score_MORT_30_AMI",
"Score_MORT_30_COPD",
"Score_MORT_30_HF",
"Score_MORT_30_PN",
"Score_MORT_30_STK")],
na.rm = TRUE)
# Remove old mortality columns
HipKneeTrain <- HipKneeTrain[, !(names(HipKneeTrain) %in% c("Score_MORT_30_AMI",
"Score_MORT_30_COPD",
"Score_MORT_30_HF",
"Score_MORT_30_PN",
"Score_MORT_30_STK"))]
# Compute correlation matrix
cor_matrix <- cor(HipKneeTrain %>% select_if(is.numeric), use = "pairwise.complete.obs")
# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)
# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")
save(HipKneeTrain, file = "HipKneeTrain.RData")
We are utilizing the most recent snapshot from 04/24/2024 as our test set. Utilizing this brand new data will help to ensure that our model is generalizable and useful for future analyses.
# Set the directory for the data files
filepath <- "/Users/adelinecasali/Desktop/hospitals_04_2024/"
# List the files in the directory that have "Hospital.csv"
files <- list.files(path = filepath, pattern = "Hospital.csv")
# Iterate through each file in the list
for(f in 1:length(files)) {
# Read the CSV, clean column names to upper camel case, and store in "dat"
dat <- clean_names(read_csv(paste0(filepath, files[f]),
show_col_types = FALSE),
case = "upper_camel")
# Remove ".Hospital.csv" part of the file names to create variable name
filename <- gsub(".Hospital\\.csv", "", files[f])
# Assign data to a variable with the above created name
assign(filename, dat)
}
# Create a df of file names without ".Hospital.csv"
files <- gsub(".Hospital\\.csv", "", files) %>% data.frame()
# Set column name of the df to "File Name"
names(files) <- "File Name"
files %>%
kable(
format = "html",
caption = "Table 1. List of hospital-level data files.") %>%
kable_styling(bootstrap_options = c("striped", full_width = F)
)
| File Name |
|---|
| Complications_and_Deaths |
| FY_2024_HAC_Reduction_Program |
| FY_2024_Hospital_Readmissions_Reduction_Program |
| HCAHPS |
| Healthcare_Associated_Infections |
| Maternal_Health |
| Medicare_Hospital_Spending_Per_Patient |
| Outpatient_Imaging_Efficiency |
| Payment_and_Value_of_Care |
| Timely_and_Effective_Care |
| Unplanned_Hospital_Visits |
# Display first 10 rows of FY_2024_Hospital_Readmissions_Reduction_Program
head(FY_2024_Hospital_Readmissions_Reduction_Program,10)
## # A tibble: 10 × 12
## FacilityName FacilityId State MeasureName NumberOfDischarges Footnote
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 SOUTHEAST HEALTH ME… 010001 AL READM-30-H… N/A NA
## 2 SOUTHEAST HEALTH ME… 010001 AL READM-30-H… 616 NA
## 3 SOUTHEAST HEALTH ME… 010001 AL READM-30-A… 274 NA
## 4 SOUTHEAST HEALTH ME… 010001 AL READM-30-P… 404 NA
## 5 SOUTHEAST HEALTH ME… 010001 AL READM-30-C… 126 NA
## 6 SOUTHEAST HEALTH ME… 010001 AL READM-30-C… 117 NA
## 7 MARSHALL MEDICAL CE… 010005 AL READM-30-A… N/A 1
## 8 MARSHALL MEDICAL CE… 010005 AL READM-30-C… 137 NA
## 9 MARSHALL MEDICAL CE… 010005 AL READM-30-P… 285 NA
## 10 MARSHALL MEDICAL CE… 010005 AL READM-30-H… 129 NA
## # ℹ 6 more variables: ExcessReadmissionRatio <chr>,
## # PredictedReadmissionRate <chr>, ExpectedReadmissionRate <chr>,
## # NumberOfReadmissions <chr>, StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
select_if(is.numeric)
# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## Footnote
## 12077
# Use the function "replace_with_na_all()" to replace aberrant values with NA
FY_2024_Hospital_Readmissions_Reduction_Program <- replace_with_na_all(FY_2024_Hospital_Readmissions_Reduction_Program, condition = ~ .x == "N/A")
# Replace "Too Few to Report" values with "5" in using gsub
FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions <- gsub("Too Few to Report", "5", FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions)
# Check first 10 rows to confirm that it worked
head(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions, 10)
## [1] "5" "149" "32" "68" "11" "20" NA "14" "40" "24"
# NumberOfReadmissions had to be converted to numeric before applying integers
FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions <- as.numeric(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions)
# Find all values of "5" in NumberOfReadmissions
fives <- which(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions == 5)
# Replace values of "5" with random integers from 1 - 10
FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions[fives] <- sample(1:10, length(fives), replace = TRUE)
# Check the first 20 rows to see if this was applied correctly
head(FY_2024_Hospital_Readmissions_Reduction_Program$NumberOfReadmissions,20)
## [1] 9 149 32 68 11 20 NA 14 40 24 9 NA 2 21 15 83 36 75 7
## [20] NA
# Selecting the columns to convert
columns_to_convert <- c("NumberOfDischarges", "ExcessReadmissionRatio", "PredictedReadmissionRate", "ExpectedReadmissionRate", "NumberOfReadmissions")
# Use mutate_at to convert the specified columns to numeric
FY_2024_Hospital_Readmissions_Reduction_Program <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
mutate_at(vars(one_of(columns_to_convert)), as.numeric)
# Print the structure of the dataframe to check the changes
str(FY_2024_Hospital_Readmissions_Reduction_Program)
## tibble [18,774 × 12] (S3: tbl_df/tbl/data.frame)
## $ FacilityName : chr [1:18774] "SOUTHEAST HEALTH MEDICAL CENTER" "SOUTHEAST HEALTH MEDICAL CENTER" "SOUTHEAST HEALTH MEDICAL CENTER" "SOUTHEAST HEALTH MEDICAL CENTER" ...
## $ FacilityId : chr [1:18774] "010001" "010001" "010001" "010001" ...
## $ State : chr [1:18774] "AL" "AL" "AL" "AL" ...
## $ MeasureName : chr [1:18774] "READM-30-HIP-KNEE-HRRP" "READM-30-HF-HRRP" "READM-30-AMI-HRRP" "READM-30-PN-HRRP" ...
## $ NumberOfDischarges : num [1:18774] NA 616 274 404 126 117 NA 137 285 129 ...
## $ Footnote : num [1:18774] NA NA NA NA NA NA 1 NA NA NA ...
## $ ExcessReadmissionRatio : num [1:18774] 0.892 1.1 0.933 0.987 0.952 ...
## $ PredictedReadmissionRate: num [1:18774] 3.53 23.13 12.9 17.05 9.81 ...
## $ ExpectedReadmissionRate : num [1:18774] 3.96 21.02 13.83 17.28 10.31 ...
## $ NumberOfReadmissions : num [1:18774] 9 149 32 68 11 20 NA 14 40 24 ...
## $ StartDate : chr [1:18774] "07/01/2019" "07/01/2019" "07/01/2019" "07/01/2019" ...
## $ EndDate : chr [1:18774] "06/30/2022" "06/30/2022" "06/30/2022" "06/30/2022" ...
FY_2024_Hospital_Readmissions_Reduction_Program <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
mutate(MeasureName = gsub("READM-30-", "", MeasureName)) %>%
mutate(MeasureName = gsub("-HRRP", "", MeasureName))
readmissionsClean <- FY_2024_Hospital_Readmissions_Reduction_Program %>%
pivot_wider(
names_from = MeasureName,
values_from = c(NumberOfDischarges, ExcessReadmissionRatio, PredictedReadmissionRate, ExpectedReadmissionRate, NumberOfReadmissions),
id_cols = c(FacilityName, FacilityId, State, StartDate, EndDate)
)
# Check the new dataframe
dim(readmissionsClean)
## [1] 3129 35
head(readmissionsClean)
## # A tibble: 6 × 35
## FacilityName FacilityId State StartDate EndDate NumberOfDischarges_H…¹
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 SOUTHEAST HEALTH ME… 010001 AL 07/01/20… 06/30/… NA
## 2 MARSHALL MEDICAL CE… 010005 AL 07/01/20… 06/30/… NA
## 3 NORTH ALABAMA MEDIC… 010006 AL 07/01/20… 06/30/… NA
## 4 MIZELL MEMORIAL HOS… 010007 AL 07/01/20… 06/30/… NA
## 5 CRENSHAW COMMUNITY … 010008 AL 07/01/20… 06/30/… NA
## 6 ST. VINCENT'S EAST 010011 AL 07/01/20… 06/30/… NA
## # ℹ abbreviated name: ¹`NumberOfDischarges_HIP-KNEE`
## # ℹ 29 more variables: NumberOfDischarges_HF <dbl>,
## # NumberOfDischarges_AMI <dbl>, NumberOfDischarges_PN <dbl>,
## # NumberOfDischarges_CABG <dbl>, NumberOfDischarges_COPD <dbl>,
## # `ExcessReadmissionRatio_HIP-KNEE` <dbl>, ExcessReadmissionRatio_HF <dbl>,
## # ExcessReadmissionRatio_AMI <dbl>, ExcessReadmissionRatio_PN <dbl>,
## # ExcessReadmissionRatio_CABG <dbl>, ExcessReadmissionRatio_COPD <dbl>, …
readmissionsClean <- readmissionsClean %>%
select(FacilityName, FacilityId, State, matches("HIP-KNEE$"))
# Display first 10 rows of HCAHPS
head(HCAHPS,10)
## # A tibble: 10 × 22
## FacilityId FacilityName Address CityTown State ZipCode CountyParish
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 2 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 3 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 4 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 5 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 6 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 7 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 8 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 9 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 10 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## # ℹ 15 more variables: TelephoneNumber <chr>, HcahpsMeasureId <chr>,
## # HcahpsQuestion <chr>, HcahpsAnswerDescription <chr>,
## # PatientSurveyStarRating <chr>, PatientSurveyStarRatingFootnote <dbl>,
## # HcahpsAnswerPercent <chr>, HcahpsAnswerPercentFootnote <chr>,
## # HcahpsLinearMeanValue <chr>, NumberOfCompletedSurveys <chr>,
## # NumberOfCompletedSurveysFootnote <chr>, SurveyResponseRatePercent <chr>,
## # SurveyResponseRatePercentFootnote <chr>, StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- HCAHPS %>%
select_if(is.numeric)
# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## PatientSurveyStarRatingFootnote
## 430641
# Removing all footnote columns
HCAHPS <- HCAHPS %>%
select(-ends_with("footnote"))
# Replacing all "Not Applicable" with NA
HCAHPS <- as.data.frame(sapply(HCAHPS, function(x) {
if (is.character(x)) {
x[x == "Not Applicable"] <- NA
}
return(x)
}))
# Replacing all "Not Available" with NA
HCAHPS <- as.data.frame(sapply(HCAHPS, function(x) {
if (is.character(x)) {
x[x == "Not Available"] <- NA
}
return(x)
}))
HCAHPSClean <- HCAHPS %>%
pivot_wider(
names_from = HcahpsMeasureId,
values_from = c(PatientSurveyStarRating, HcahpsAnswerPercent, HcahpsLinearMeanValue, SurveyResponseRatePercent),
id_cols = c(FacilityName, FacilityId, State)
)
# Check the new dataframe
dim(HCAHPSClean)
## [1] 4814 375
head(HCAHPSClean)
## # A tibble: 6 × 375
## FacilityName FacilityId State PatientSurveyStarRat…¹ PatientSurveyStarRat…²
## <chr> <chr> <chr> <chr> <chr>
## 1 SOUTHEAST HEAL… 010001 AL <NA> <NA>
## 2 MARSHALL MEDIC… 010005 AL <NA> <NA>
## 3 NORTH ALABAMA … 010006 AL <NA> <NA>
## 4 MIZELL MEMORIA… 010007 AL <NA> <NA>
## 5 CRENSHAW COMMU… 010008 AL <NA> <NA>
## 6 ST. VINCENT'S … 010011 AL <NA> <NA>
## # ℹ abbreviated names: ¹PatientSurveyStarRating_H_COMP_1_A_P,
## # ²PatientSurveyStarRating_H_COMP_1_SN_P
## # ℹ 370 more variables: PatientSurveyStarRating_H_COMP_1_U_P <chr>,
## # PatientSurveyStarRating_H_COMP_1_LINEAR_SCORE <chr>,
## # PatientSurveyStarRating_H_COMP_1_STAR_RATING <chr>,
## # PatientSurveyStarRating_H_NURSE_RESPECT_A_P <chr>,
## # PatientSurveyStarRating_H_NURSE_RESPECT_SN_P <chr>, …
# Display first 10 rows of Timely_and_Effective_Care
head(Timely_and_Effective_Care,10)
## # A tibble: 10 × 16
## FacilityId FacilityName Address CityTown State ZipCode CountyParish
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 2 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 3 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 4 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 5 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 6 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 7 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 8 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 9 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 10 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## # ℹ 9 more variables: TelephoneNumber <chr>, Condition <chr>, MeasureId <chr>,
## # MeasureName <chr>, Score <chr>, Sample <chr>, Footnote <chr>,
## # StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- Timely_and_Effective_Care %>%
select_if(is.numeric)
# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## named list()
# Replacing all "Not Applicable" with NA
Timely_and_Effective_Care <- as.data.frame(sapply(Timely_and_Effective_Care, function(x) {
if (is.character(x)) {
x[x == "Not Applicable"] <- NA
}
return(x)
}))
# Replacing all "Not Available" with NA
Timely_and_Effective_Care <- as.data.frame(sapply(Timely_and_Effective_Care, function(x) {
if (is.character(x)) {
x[x == "Not Available"] <- NA
}
return(x)
}))
careClean <- Timely_and_Effective_Care %>%
pivot_wider(
names_from = MeasureId,
values_from = c(Score),
id_cols = c(FacilityName, FacilityId, State)
)
# Check the new dataframe
dim(careClean)
## [1] 4677 26
head(careClean)
## # A tibble: 6 × 26
## FacilityName FacilityId State EDV ED_2_Strata_1 ED_2_Strata_2 HCP_COVID_19
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 SOUTHEAST HEA… 010001 AL high <NA> <NA> 80.7
## 2 MARSHALL MEDI… 010005 AL high 148 105 79.8
## 3 NORTH ALABAMA… 010006 AL high <NA> <NA> 79
## 4 MIZELL MEMORI… 010007 AL low <NA> <NA> 57.9
## 5 CRENSHAW COMM… 010008 AL low <NA> <NA> 81.2
## 6 ST. VINCENT'S… 010011 AL high <NA> <NA> 88
## # ℹ 19 more variables: IMM_3 <chr>, OP_18b <chr>, OP_18c <chr>, OP_22 <chr>,
## # OP_23 <chr>, OP_29 <chr>, OP_31 <chr>, SAFE_USE_OF_OPIOIDS <chr>,
## # SEP_1 <chr>, SEP_SH_3HR <chr>, SEP_SH_6HR <chr>, SEV_SEP_3HR <chr>,
## # SEV_SEP_6HR <chr>, STK_02 <chr>, STK_03 <chr>, STK_05 <chr>, STK_06 <chr>,
## # VTE_1 <chr>, VTE_2 <chr>
# Display first 10 rows of Complications_and_Deaths
head(Complications_and_Deaths,10)
## # A tibble: 10 × 18
## FacilityId FacilityName Address CityTown State ZipCode CountyParish
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 2 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 3 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 4 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 5 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 6 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 7 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 8 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 9 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 10 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## # ℹ 11 more variables: TelephoneNumber <chr>, MeasureId <chr>,
## # MeasureName <chr>, ComparedToNational <chr>, Denominator <chr>,
## # Score <chr>, LowerEstimate <chr>, HigherEstimate <chr>, Footnote <chr>,
## # StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- Complications_and_Deaths %>%
select_if(is.numeric)
# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## named list()
# Replacing all "Not Applicable" with NA
Complications_and_Deaths <- as.data.frame(sapply(Complications_and_Deaths, function(x) {
if (is.character(x)) {
x[x == "Not Applicable"] <- NA
}
return(x)
}))
# Replacing all "Not Available" with NA
Complications_and_Deaths <- as.data.frame(sapply(Complications_and_Deaths, function(x) {
if (is.character(x)) {
x[x == "Not Available"] <- NA
}
return(x)
}))
deathsClean <- Complications_and_Deaths %>%
pivot_wider(
names_from = MeasureId,
values_from = c(ComparedToNational, Score),
id_cols = c(FacilityName, FacilityId, State)
)
# Check the new dataframe
dim(deathsClean)
## [1] 4814 41
head(deathsClean)
## # A tibble: 6 × 41
## FacilityName FacilityId State ComparedToNational_C…¹ ComparedToNational_M…²
## <chr> <chr> <chr> <chr> <chr>
## 1 SOUTHEAST HEAL… 010001 AL No Different Than the… No Different Than the…
## 2 MARSHALL MEDIC… 010005 AL No Different Than the… No Different Than the…
## 3 NORTH ALABAMA … 010006 AL No Different Than the… Worse Than the Nation…
## 4 MIZELL MEMORIA… 010007 AL Number of Cases Too S… Number of Cases Too S…
## 5 CRENSHAW COMMU… 010008 AL <NA> Number of Cases Too S…
## 6 ST. VINCENT'S … 010011 AL No Different Than the… No Different Than the…
## # ℹ abbreviated names: ¹ComparedToNational_COMP_HIP_KNEE,
## # ²ComparedToNational_MORT_30_AMI
## # ℹ 36 more variables: ComparedToNational_MORT_30_CABG <chr>,
## # ComparedToNational_MORT_30_COPD <chr>, ComparedToNational_MORT_30_HF <chr>,
## # ComparedToNational_MORT_30_PN <chr>, ComparedToNational_MORT_30_STK <chr>,
## # ComparedToNational_PSI_03 <chr>, ComparedToNational_PSI_04 <chr>,
## # ComparedToNational_PSI_06 <chr>, ComparedToNational_PSI_08 <chr>, …
# Display first 10 rows of Payment_and_Value_of_Care
head(Payment_and_Value_of_Care,10)
## # A tibble: 10 × 22
## FacilityId FacilityName Address CityTown State ZipCode CountyParish
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 2 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 3 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 4 010001 SOUTHEAST HEALTH MEDI… 1108 R… DOTHAN AL 36301 HOUSTON
## 5 010005 MARSHALL MEDICAL CENT… 2505 U… BOAZ AL 35957 MARSHALL
## 6 010005 MARSHALL MEDICAL CENT… 2505 U… BOAZ AL 35957 MARSHALL
## 7 010005 MARSHALL MEDICAL CENT… 2505 U… BOAZ AL 35957 MARSHALL
## 8 010005 MARSHALL MEDICAL CENT… 2505 U… BOAZ AL 35957 MARSHALL
## 9 010006 NORTH ALABAMA MEDICAL… 1701 V… FLORENCE AL 35630 LAUDERDALE
## 10 010006 NORTH ALABAMA MEDICAL… 1701 V… FLORENCE AL 35630 LAUDERDALE
## # ℹ 15 more variables: TelephoneNumber <chr>, PaymentMeasureId <chr>,
## # PaymentMeasureName <chr>, PaymentCategory <chr>, Denominator <chr>,
## # Payment <chr>, LowerEstimate <chr>, HigherEstimate <chr>,
## # PaymentFootnote <dbl>, ValueOfCareDisplayId <chr>,
## # ValueOfCareDisplayName <chr>, ValueOfCareCategory <chr>,
## # ValueOfCareFootnote <dbl>, StartDate <chr>, EndDate <chr>
# Filter dataset to include numeric columns only
num_vars <- Payment_and_Value_of_Care %>%
select_if(is.numeric)
# Check for missing values
miss_vals <- sapply(num_vars, function(x) sum(is.na(x)))
print(miss_vals)
## PaymentFootnote ValueOfCareFootnote
## 9956 10044
# Replacing all "Not Applicable" with NA
Payment_and_Value_of_Care <- as.data.frame(sapply(Payment_and_Value_of_Care, function(x) {
if (is.character(x)) {
x[x == "Not Applicable"] <- NA
}
return(x)
}))
# Replacing all "Not Available" with NA
Payment_and_Value_of_Care <- as.data.frame(sapply(Payment_and_Value_of_Care, function(x) {
if (is.character(x)) {
x[x == "Not Available"] <- NA
}
return(x)
}))
paymentClean <- Payment_and_Value_of_Care %>%
pivot_wider(
names_from = PaymentMeasureId,
values_from = c(PaymentCategory, Payment),
id_cols = c(FacilityName, FacilityId, State)
)
# Check the new dataframe
dim(paymentClean)
## [1] 4645 11
head(paymentClean)
## # A tibble: 6 × 11
## FacilityName FacilityId State PaymentCategory_PAYM…¹ PaymentCategory_PAYM…²
## <chr> <chr> <chr> <chr> <chr>
## 1 SOUTHEAST HEAL… 010001 AL No Different Than the… No Different Than the…
## 2 MARSHALL MEDIC… 010005 AL No Different Than the… No Different Than the…
## 3 NORTH ALABAMA … 010006 AL Greater Than the Nati… No Different Than the…
## 4 MIZELL MEMORIA… 010007 AL Number of Cases Too S… No Different Than the…
## 5 CRENSHAW COMMU… 010008 AL Number of Cases Too S… Number of Cases Too S…
## 6 ST. VINCENT'S … 010011 AL No Different Than the… No Different Than the…
## # ℹ abbreviated names: ¹PaymentCategory_PAYM_30_AMI,
## # ²PaymentCategory_PAYM_30_HF
## # ℹ 6 more variables: PaymentCategory_PAYM_30_PN <chr>,
## # PaymentCategory_PAYM_90_HIP_KNEE <chr>, Payment_PAYM_30_AMI <chr>,
## # Payment_PAYM_30_HF <chr>, Payment_PAYM_30_PN <chr>,
## # Payment_PAYM_90_HIP_KNEE <chr>
HipKneeCleanTest <- readmissionsClean %>%
full_join(HCAHPSClean, by = "FacilityId") %>%
full_join(careClean, by = "FacilityId") %>%
full_join(deathsClean, by = "FacilityId") %>%
full_join(paymentClean, by = "FacilityId")
head(HipKneeCleanTest)
## # A tibble: 6 × 451
## FacilityName.x FacilityId State.x NumberOfDischarges_HIP-KN…¹
## <chr> <chr> <chr> <dbl>
## 1 SOUTHEAST HEALTH MEDICAL CENTER 010001 AL NA
## 2 MARSHALL MEDICAL CENTERS 010005 AL NA
## 3 NORTH ALABAMA MEDICAL CENTER 010006 AL NA
## 4 MIZELL MEMORIAL HOSPITAL 010007 AL NA
## 5 CRENSHAW COMMUNITY HOSPITAL 010008 AL NA
## 6 ST. VINCENT'S EAST 010011 AL NA
## # ℹ abbreviated name: ¹`NumberOfDischarges_HIP-KNEE`
## # ℹ 447 more variables: `ExcessReadmissionRatio_HIP-KNEE` <dbl>,
## # `PredictedReadmissionRate_HIP-KNEE` <dbl>,
## # `ExpectedReadmissionRate_HIP-KNEE` <dbl>,
## # `NumberOfReadmissions_HIP-KNEE` <dbl>, FacilityName.y <chr>, State.y <chr>,
## # PatientSurveyStarRating_H_COMP_1_A_P <chr>,
## # PatientSurveyStarRating_H_COMP_1_SN_P <chr>, …
# Removing duplicate columns
HipKneeCleanTest <- HipKneeCleanTest %>%
select(-matches("\\.(x|y|z|w|v)$"))
# Checking the dimensions
dim(HipKneeCleanTest)
# Count NA values in each column
na_counts <- sapply(HipKneeCleanTest, function(x) sum(is.na(x)))
# View the NA counts
print(na_counts)
# Calculate the percentage of NA values for each column
na_percentage <- sapply(HipKneeCleanTest, function(x) mean(is.na(x)))
# Remove columns where more than 80% of the values are NA
HipKneeCleanTest <- HipKneeCleanTest[, na_percentage <= 0.8]
# Count NA values in each column
na_counts <- sapply(HipKneeCleanTest, function(x) sum(is.na(x)))
# View the NA counts
print(na_counts)
# Check the dimensions
dim(HipKneeCleanTest)
# Remove columns containing 'AnswerPercent' or 'SurveyResponseRate'
HipKneeCleanTest <- HipKneeCleanTest %>%
select(-matches("AnswerPercent|SurveyResponseRate"))
# Check the dimensions
dim(HipKneeCleanTest)
## [1] 4816 87
# Remove columns containing 'ComparedToNational' and 'PaymentCategory'
HipKneeCleanTest <- HipKneeCleanTest %>%
select(-matches("ComparedToNational|PaymentCategory"))
# Check the dimensions
dim(HipKneeCleanTest)
## [1] 4816 67
str(HipKneeCleanTest)
## tibble [4,816 × 67] (S3: tbl_df/tbl/data.frame)
## $ FacilityId : chr [1:4816] "010001" "010005" "010006" "010007" ...
## $ ExcessReadmissionRatio_HIP-KNEE : num [1:4816] 0.892 0.798 1.247 0.992 NA ...
## $ PredictedReadmissionRate_HIP-KNEE : num [1:4816] 3.53 3.76 5.52 4.34 NA ...
## $ ExpectedReadmissionRate_HIP-KNEE : num [1:4816] 3.96 4.72 4.43 4.37 NA ...
## $ NumberOfReadmissions_HIP-KNEE : num [1:4816] 9 9 2 1 NA 9 9 10 NA 5 ...
## $ PatientSurveyStarRating_H_COMP_1_STAR_RATING : chr [1:4816] "3" "3" "2" "3" ...
## $ PatientSurveyStarRating_H_COMP_2_STAR_RATING : chr [1:4816] "4" "4" "3" "5" ...
## $ PatientSurveyStarRating_H_COMP_3_STAR_RATING : chr [1:4816] "3" "2" "2" "4" ...
## $ PatientSurveyStarRating_H_COMP_5_STAR_RATING : chr [1:4816] "3" "3" "2" "3" ...
## $ PatientSurveyStarRating_H_COMP_6_STAR_RATING : chr [1:4816] "4" "3" "3" "4" ...
## $ PatientSurveyStarRating_H_COMP_7_STAR_RATING : chr [1:4816] "4" "3" "2" "4" ...
## $ PatientSurveyStarRating_H_CLEAN_STAR_RATING : chr [1:4816] "3" "2" "1" "2" ...
## $ PatientSurveyStarRating_H_QUIET_STAR_RATING : chr [1:4816] "4" "4" "4" "4" ...
## $ PatientSurveyStarRating_H_HSP_RATING_STAR_RATING: chr [1:4816] "4" "3" "2" "4" ...
## $ PatientSurveyStarRating_H_RECMND_STAR_RATING : chr [1:4816] "4" "3" "2" "4" ...
## $ PatientSurveyStarRating_H_STAR_RATING : chr [1:4816] "4" "3" "2" "4" ...
## $ HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE : chr [1:4816] "89" "90" "88" "91" ...
## $ HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE : chr [1:4816] "91" "92" "89" "95" ...
## $ HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE : chr [1:4816] "81" "75" "75" "88" ...
## $ HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE : chr [1:4816] "77" "76" "71" "77" ...
## $ HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE : chr [1:4816] "87" "86" "83" "87" ...
## $ HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE : chr [1:4816] "82" "79" "77" "82" ...
## $ HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE : chr [1:4816] "84" "80" "74" "80" ...
## $ HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE : chr [1:4816] "86" "85" "85" "87" ...
## $ HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE : chr [1:4816] "89" "85" "82" "89" ...
## $ HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE : chr [1:4816] "90" "83" "79" "88" ...
## $ EDV : chr [1:4816] "high" "high" "high" "low" ...
## $ ED_2_Strata_1 : chr [1:4816] NA "148" NA NA ...
## $ HCP_COVID_19 : chr [1:4816] "80.7" "79.8" "79" "57.9" ...
## $ IMM_3 : chr [1:4816] "95" "80" "67" "53" ...
## $ OP_18b : chr [1:4816] "215" "147" "177" "130" ...
## $ OP_18c : chr [1:4816] "317" "266" NA "216" ...
## $ OP_22 : chr [1:4816] "5" "3" "1" "4" ...
## $ OP_23 : chr [1:4816] NA NA "69" NA ...
## $ OP_29 : chr [1:4816] "47" "96" "85" "23" ...
## $ SAFE_USE_OF_OPIOIDS : chr [1:4816] "14" "19" "17" NA ...
## $ SEP_1 : chr [1:4816] "66" "74" "56" "86" ...
## $ SEP_SH_3HR : chr [1:4816] "70" "88" "77" NA ...
## $ SEP_SH_6HR : chr [1:4816] "100" "91" "81" NA ...
## $ SEV_SEP_3HR : chr [1:4816] "79" "88" "78" "89" ...
## $ SEV_SEP_6HR : chr [1:4816] "95" "96" "86" "97" ...
## $ STK_02 : chr [1:4816] "98" "100" "96" NA ...
## $ STK_05 : chr [1:4816] NA "91" NA NA ...
## $ STK_06 : chr [1:4816] NA NA "97" NA ...
## $ VTE_1 : chr [1:4816] "98" NA NA NA ...
## $ VTE_2 : chr [1:4816] "99" NA "97" NA ...
## $ Score_COMP_HIP_KNEE : chr [1:4816] "2.7" "2.3" "4.6" NA ...
## $ Score_MORT_30_AMI : chr [1:4816] "12" "13.6" "16.5" NA ...
## $ Score_MORT_30_COPD : chr [1:4816] "8.8" "9.9" "9.9" "13.7" ...
## $ Score_MORT_30_HF : chr [1:4816] "8.9" "14.9" "12.5" "12.5" ...
## $ Score_MORT_30_PN : chr [1:4816] "18" "23.3" "19.5" "28.5" ...
## $ Score_MORT_30_STK : chr [1:4816] "14.8" "15.3" "17.2" NA ...
## $ Score_PSI_03 : chr [1:4816] "0.39" "0.94" "1.39" "0.42" ...
## $ Score_PSI_04 : chr [1:4816] "184.68" "183.49" "173.63" NA ...
## $ Score_PSI_06 : chr [1:4816] "0.23" "0.22" "0.36" "0.24" ...
## $ Score_PSI_08 : chr [1:4816] "0.10" "0.09" "0.08" "0.09" ...
## $ Score_PSI_09 : chr [1:4816] "2.39" "2.69" "5.43" "2.49" ...
## $ Score_PSI_10 : chr [1:4816] "1.14" "1.37" "1.26" "1.57" ...
## $ Score_PSI_11 : chr [1:4816] "13.83" "7.19" "7.37" "8.45" ...
## $ Score_PSI_12 : chr [1:4816] "4.49" "3.01" "3.36" "3.89" ...
## $ Score_PSI_13 : chr [1:4816] "8.05" "4.46" "4.37" "5.19" ...
## $ Score_PSI_14 : chr [1:4816] "1.69" "1.87" "1.76" NA ...
## $ Score_PSI_15 : chr [1:4816] "0.93" "0.91" "1.34" "1.08" ...
## $ Score_PSI_90 : chr [1:4816] "1.21" "0.97" "1.17" "0.95" ...
## $ FacilityName : chr [1:4816] "SOUTHEAST HEALTH MEDICAL CENTER" "MARSHALL MEDICAL CENTERS" "NORTH ALABAMA MEDICAL CENTER" "MIZELL MEMORIAL HOSPITAL" ...
## $ State : chr [1:4816] "AL" "AL" "AL" "AL" ...
## $ Payment_PAYM_90_HIP_KNEE : chr [1:4816] "$22,212" "$18,030" "$21,898" NA ...
# Convert columns to numeric
HipKneeCleanTest <- HipKneeCleanTest %>%
mutate_at(vars(starts_with("PatientSurveyStarRating_"),
starts_with("HcahpsLinearMeanValue_"),
starts_with("Score_"),
starts_with("ED_"),
starts_with("IMM_"),
starts_with("OP_"),
starts_with("SEP_"),
starts_with("SEV_"),
starts_with("STK_"),
starts_with("VTE_"),
starts_with("SAFE_"),
starts_with("HCP_")),
~ as.numeric(as.character(.)))
# View the structure
str(HipKneeCleanTest)
## tibble [4,816 × 67] (S3: tbl_df/tbl/data.frame)
## $ FacilityId : chr [1:4816] "010001" "010005" "010006" "010007" ...
## $ ExcessReadmissionRatio_HIP-KNEE : num [1:4816] 0.892 0.798 1.247 0.992 NA ...
## $ PredictedReadmissionRate_HIP-KNEE : num [1:4816] 3.53 3.76 5.52 4.34 NA ...
## $ ExpectedReadmissionRate_HIP-KNEE : num [1:4816] 3.96 4.72 4.43 4.37 NA ...
## $ NumberOfReadmissions_HIP-KNEE : num [1:4816] 9 9 2 1 NA 9 9 10 NA 5 ...
## $ PatientSurveyStarRating_H_COMP_1_STAR_RATING : num [1:4816] 3 3 2 3 NA 3 3 3 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_2_STAR_RATING : num [1:4816] 4 4 3 5 NA 3 4 4 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_3_STAR_RATING : num [1:4816] 3 2 2 4 NA 4 3 2 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_5_STAR_RATING : num [1:4816] 3 3 2 3 NA 3 3 2 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_6_STAR_RATING : num [1:4816] 4 3 3 4 NA 3 3 2 NA 3 ...
## $ PatientSurveyStarRating_H_COMP_7_STAR_RATING : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
## $ PatientSurveyStarRating_H_CLEAN_STAR_RATING : num [1:4816] 3 2 1 2 NA 2 2 1 NA 4 ...
## $ PatientSurveyStarRating_H_QUIET_STAR_RATING : num [1:4816] 4 4 4 4 NA 4 4 3 NA 5 ...
## $ PatientSurveyStarRating_H_HSP_RATING_STAR_RATING: num [1:4816] 4 3 2 4 NA 3 2 3 NA 4 ...
## $ PatientSurveyStarRating_H_RECMND_STAR_RATING : num [1:4816] 4 3 2 4 NA 4 2 3 NA 4 ...
## $ PatientSurveyStarRating_H_STAR_RATING : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
## $ HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE : num [1:4816] 89 90 88 91 NA 90 91 89 NA 92 ...
## $ HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE : num [1:4816] 91 92 89 95 NA 90 91 91 NA 92 ...
## $ HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE : num [1:4816] 81 75 75 88 NA 85 80 78 NA 85 ...
## $ HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE : num [1:4816] 77 76 71 77 NA 76 76 72 NA 78 ...
## $ HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE : num [1:4816] 87 86 83 87 NA 86 86 81 NA 86 ...
## $ HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE : num [1:4816] 82 79 77 82 NA 81 79 80 NA 83 ...
## $ HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE : num [1:4816] 84 80 74 80 NA 81 83 78 NA 88 ...
## $ HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE : num [1:4816] 86 85 85 87 NA 84 84 82 NA 89 ...
## $ HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE : num [1:4816] 89 85 82 89 NA 88 83 85 NA 90 ...
## $ HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE : num [1:4816] 90 83 79 88 NA 87 80 84 NA 91 ...
## $ EDV : chr [1:4816] "high" "high" "high" "low" ...
## $ ED_2_Strata_1 : num [1:4816] NA 148 NA NA NA NA NA NA NA NA ...
## $ HCP_COVID_19 : num [1:4816] 80.7 79.8 79 57.9 81.2 88 69.8 87.3 95.9 85.3 ...
## $ IMM_3 : num [1:4816] 95 80 67 53 45 81 65 93 98 81 ...
## $ OP_18b : num [1:4816] 215 147 177 130 118 206 160 185 102 145 ...
## $ OP_18c : num [1:4816] 317 266 NA 216 98 124 220 220 NA 324 ...
## $ OP_22 : num [1:4816] 5 3 1 4 0 5 4 3 0 2 ...
## $ OP_23 : num [1:4816] NA NA 69 NA NA 47 NA 73 NA 35 ...
## $ OP_29 : num [1:4816] 47 96 85 23 67 100 100 NA NA 82 ...
## $ SAFE_USE_OF_OPIOIDS : num [1:4816] 14 19 17 NA NA 20 14 23 NA 17 ...
## $ SEP_1 : num [1:4816] 66 74 56 86 NA 51 92 77 NA 87 ...
## $ SEP_SH_3HR : num [1:4816] 70 88 77 NA NA 78 94 83 NA 90 ...
## $ SEP_SH_6HR : num [1:4816] 100 91 81 NA NA 81 83 100 NA 94 ...
## $ SEV_SEP_3HR : num [1:4816] 79 88 78 89 NA 69 95 85 NA 94 ...
## $ SEV_SEP_6HR : num [1:4816] 95 96 86 97 NA 91 99 97 NA 99 ...
## $ STK_02 : num [1:4816] 98 100 96 NA NA 93 NA 99 NA NA ...
## $ STK_05 : num [1:4816] NA 91 NA NA NA NA NA NA NA NA ...
## $ STK_06 : num [1:4816] NA NA 97 NA NA NA NA NA NA NA ...
## $ VTE_1 : num [1:4816] 98 NA NA NA NA 79 89 84 44 59 ...
## $ VTE_2 : num [1:4816] 99 NA 97 NA NA 88 93 94 NA NA ...
## $ Score_COMP_HIP_KNEE : num [1:4816] 2.7 2.3 4.6 NA NA 3.5 3.8 3.5 NA 4.3 ...
## $ Score_MORT_30_AMI : num [1:4816] 12 13.6 16.5 NA NA 13.2 13.8 13.1 NA NA ...
## $ Score_MORT_30_COPD : num [1:4816] 8.8 9.9 9.9 13.7 NA 10.3 NA 9.2 NA 7.8 ...
## $ Score_MORT_30_HF : num [1:4816] 8.9 14.9 12.5 12.5 NA 13.5 13.6 9.9 NA 16.9 ...
## $ Score_MORT_30_PN : num [1:4816] 18 23.3 19.5 28.5 NA 20.9 22 17.2 NA 26.1 ...
## $ Score_MORT_30_STK : num [1:4816] 14.8 15.3 17.2 NA NA 12.3 NA 13.2 NA 17.3 ...
## $ Score_PSI_03 : num [1:4816] 0.39 0.94 1.39 0.42 0.54 0.13 0.41 0.63 0.57 0.47 ...
## $ Score_PSI_04 : num [1:4816] 185 183 174 NA NA ...
## $ Score_PSI_06 : num [1:4816] 0.23 0.22 0.36 0.24 0.25 0.24 0.24 0.21 0.25 0.22 ...
## $ Score_PSI_08 : num [1:4816] 0.1 0.09 0.08 0.09 0.09 0.08 0.09 0.09 0.09 0.09 ...
## $ Score_PSI_09 : num [1:4816] 2.39 2.69 5.43 2.49 NA 1.88 2.44 3.29 2.44 2.58 ...
## $ Score_PSI_10 : num [1:4816] 1.14 1.37 1.26 1.57 NA 1.72 1.51 1.2 1.57 NA ...
## $ Score_PSI_11 : num [1:4816] 13.83 7.19 7.37 8.45 NA ...
## $ Score_PSI_12 : num [1:4816] 4.49 3.01 3.36 3.89 NA 3.04 3.32 3.67 3.56 5.63 ...
## $ Score_PSI_13 : num [1:4816] 8.05 4.46 4.37 5.19 NA 5.55 4.88 6.08 5.18 NA ...
## $ Score_PSI_14 : num [1:4816] 1.69 1.87 1.76 NA NA 1.86 2.46 2.77 NA 1.83 ...
## $ Score_PSI_15 : num [1:4816] 0.93 0.91 1.34 1.08 NA 1.18 1.04 0.84 NA 0.88 ...
## $ Score_PSI_90 : num [1:4816] 1.21 0.97 1.17 0.95 NA 0.72 0.89 1.17 0.98 1.05 ...
## $ FacilityName : chr [1:4816] "SOUTHEAST HEALTH MEDICAL CENTER" "MARSHALL MEDICAL CENTERS" "NORTH ALABAMA MEDICAL CENTER" "MIZELL MEMORIAL HOSPITAL" ...
## $ State : chr [1:4816] "AL" "AL" "AL" "AL" ...
## $ Payment_PAYM_90_HIP_KNEE : chr [1:4816] "$22,212" "$18,030" "$21,898" NA ...
# Remove $ and , and convert to numeric
HipKneeCleanTest <- HipKneeCleanTest %>%
mutate_at(vars(starts_with("Payment_")),
~ as.numeric(gsub("[\\$,]", "", .)))
# Checking the structure
str(HipKneeCleanTest)
## tibble [4,816 × 67] (S3: tbl_df/tbl/data.frame)
## $ FacilityId : chr [1:4816] "010001" "010005" "010006" "010007" ...
## $ ExcessReadmissionRatio_HIP-KNEE : num [1:4816] 0.892 0.798 1.247 0.992 NA ...
## $ PredictedReadmissionRate_HIP-KNEE : num [1:4816] 3.53 3.76 5.52 4.34 NA ...
## $ ExpectedReadmissionRate_HIP-KNEE : num [1:4816] 3.96 4.72 4.43 4.37 NA ...
## $ NumberOfReadmissions_HIP-KNEE : num [1:4816] 9 9 2 1 NA 9 9 10 NA 5 ...
## $ PatientSurveyStarRating_H_COMP_1_STAR_RATING : num [1:4816] 3 3 2 3 NA 3 3 3 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_2_STAR_RATING : num [1:4816] 4 4 3 5 NA 3 4 4 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_3_STAR_RATING : num [1:4816] 3 2 2 4 NA 4 3 2 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_5_STAR_RATING : num [1:4816] 3 3 2 3 NA 3 3 2 NA 4 ...
## $ PatientSurveyStarRating_H_COMP_6_STAR_RATING : num [1:4816] 4 3 3 4 NA 3 3 2 NA 3 ...
## $ PatientSurveyStarRating_H_COMP_7_STAR_RATING : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
## $ PatientSurveyStarRating_H_CLEAN_STAR_RATING : num [1:4816] 3 2 1 2 NA 2 2 1 NA 4 ...
## $ PatientSurveyStarRating_H_QUIET_STAR_RATING : num [1:4816] 4 4 4 4 NA 4 4 3 NA 5 ...
## $ PatientSurveyStarRating_H_HSP_RATING_STAR_RATING: num [1:4816] 4 3 2 4 NA 3 2 3 NA 4 ...
## $ PatientSurveyStarRating_H_RECMND_STAR_RATING : num [1:4816] 4 3 2 4 NA 4 2 3 NA 4 ...
## $ PatientSurveyStarRating_H_STAR_RATING : num [1:4816] 4 3 2 4 NA 3 3 3 NA 4 ...
## $ HcahpsLinearMeanValue_H_COMP_1_LINEAR_SCORE : num [1:4816] 89 90 88 91 NA 90 91 89 NA 92 ...
## $ HcahpsLinearMeanValue_H_COMP_2_LINEAR_SCORE : num [1:4816] 91 92 89 95 NA 90 91 91 NA 92 ...
## $ HcahpsLinearMeanValue_H_COMP_3_LINEAR_SCORE : num [1:4816] 81 75 75 88 NA 85 80 78 NA 85 ...
## $ HcahpsLinearMeanValue_H_COMP_5_LINEAR_SCORE : num [1:4816] 77 76 71 77 NA 76 76 72 NA 78 ...
## $ HcahpsLinearMeanValue_H_COMP_6_LINEAR_SCORE : num [1:4816] 87 86 83 87 NA 86 86 81 NA 86 ...
## $ HcahpsLinearMeanValue_H_COMP_7_LINEAR_SCORE : num [1:4816] 82 79 77 82 NA 81 79 80 NA 83 ...
## $ HcahpsLinearMeanValue_H_CLEAN_LINEAR_SCORE : num [1:4816] 84 80 74 80 NA 81 83 78 NA 88 ...
## $ HcahpsLinearMeanValue_H_QUIET_LINEAR_SCORE : num [1:4816] 86 85 85 87 NA 84 84 82 NA 89 ...
## $ HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE : num [1:4816] 89 85 82 89 NA 88 83 85 NA 90 ...
## $ HcahpsLinearMeanValue_H_RECMND_LINEAR_SCORE : num [1:4816] 90 83 79 88 NA 87 80 84 NA 91 ...
## $ EDV : chr [1:4816] "high" "high" "high" "low" ...
## $ ED_2_Strata_1 : num [1:4816] NA 148 NA NA NA NA NA NA NA NA ...
## $ HCP_COVID_19 : num [1:4816] 80.7 79.8 79 57.9 81.2 88 69.8 87.3 95.9 85.3 ...
## $ IMM_3 : num [1:4816] 95 80 67 53 45 81 65 93 98 81 ...
## $ OP_18b : num [1:4816] 215 147 177 130 118 206 160 185 102 145 ...
## $ OP_18c : num [1:4816] 317 266 NA 216 98 124 220 220 NA 324 ...
## $ OP_22 : num [1:4816] 5 3 1 4 0 5 4 3 0 2 ...
## $ OP_23 : num [1:4816] NA NA 69 NA NA 47 NA 73 NA 35 ...
## $ OP_29 : num [1:4816] 47 96 85 23 67 100 100 NA NA 82 ...
## $ SAFE_USE_OF_OPIOIDS : num [1:4816] 14 19 17 NA NA 20 14 23 NA 17 ...
## $ SEP_1 : num [1:4816] 66 74 56 86 NA 51 92 77 NA 87 ...
## $ SEP_SH_3HR : num [1:4816] 70 88 77 NA NA 78 94 83 NA 90 ...
## $ SEP_SH_6HR : num [1:4816] 100 91 81 NA NA 81 83 100 NA 94 ...
## $ SEV_SEP_3HR : num [1:4816] 79 88 78 89 NA 69 95 85 NA 94 ...
## $ SEV_SEP_6HR : num [1:4816] 95 96 86 97 NA 91 99 97 NA 99 ...
## $ STK_02 : num [1:4816] 98 100 96 NA NA 93 NA 99 NA NA ...
## $ STK_05 : num [1:4816] NA 91 NA NA NA NA NA NA NA NA ...
## $ STK_06 : num [1:4816] NA NA 97 NA NA NA NA NA NA NA ...
## $ VTE_1 : num [1:4816] 98 NA NA NA NA 79 89 84 44 59 ...
## $ VTE_2 : num [1:4816] 99 NA 97 NA NA 88 93 94 NA NA ...
## $ Score_COMP_HIP_KNEE : num [1:4816] 2.7 2.3 4.6 NA NA 3.5 3.8 3.5 NA 4.3 ...
## $ Score_MORT_30_AMI : num [1:4816] 12 13.6 16.5 NA NA 13.2 13.8 13.1 NA NA ...
## $ Score_MORT_30_COPD : num [1:4816] 8.8 9.9 9.9 13.7 NA 10.3 NA 9.2 NA 7.8 ...
## $ Score_MORT_30_HF : num [1:4816] 8.9 14.9 12.5 12.5 NA 13.5 13.6 9.9 NA 16.9 ...
## $ Score_MORT_30_PN : num [1:4816] 18 23.3 19.5 28.5 NA 20.9 22 17.2 NA 26.1 ...
## $ Score_MORT_30_STK : num [1:4816] 14.8 15.3 17.2 NA NA 12.3 NA 13.2 NA 17.3 ...
## $ Score_PSI_03 : num [1:4816] 0.39 0.94 1.39 0.42 0.54 0.13 0.41 0.63 0.57 0.47 ...
## $ Score_PSI_04 : num [1:4816] 185 183 174 NA NA ...
## $ Score_PSI_06 : num [1:4816] 0.23 0.22 0.36 0.24 0.25 0.24 0.24 0.21 0.25 0.22 ...
## $ Score_PSI_08 : num [1:4816] 0.1 0.09 0.08 0.09 0.09 0.08 0.09 0.09 0.09 0.09 ...
## $ Score_PSI_09 : num [1:4816] 2.39 2.69 5.43 2.49 NA 1.88 2.44 3.29 2.44 2.58 ...
## $ Score_PSI_10 : num [1:4816] 1.14 1.37 1.26 1.57 NA 1.72 1.51 1.2 1.57 NA ...
## $ Score_PSI_11 : num [1:4816] 13.83 7.19 7.37 8.45 NA ...
## $ Score_PSI_12 : num [1:4816] 4.49 3.01 3.36 3.89 NA 3.04 3.32 3.67 3.56 5.63 ...
## $ Score_PSI_13 : num [1:4816] 8.05 4.46 4.37 5.19 NA 5.55 4.88 6.08 5.18 NA ...
## $ Score_PSI_14 : num [1:4816] 1.69 1.87 1.76 NA NA 1.86 2.46 2.77 NA 1.83 ...
## $ Score_PSI_15 : num [1:4816] 0.93 0.91 1.34 1.08 NA 1.18 1.04 0.84 NA 0.88 ...
## $ Score_PSI_90 : num [1:4816] 1.21 0.97 1.17 0.95 NA 0.72 0.89 1.17 0.98 1.05 ...
## $ FacilityName : chr [1:4816] "SOUTHEAST HEALTH MEDICAL CENTER" "MARSHALL MEDICAL CENTERS" "NORTH ALABAMA MEDICAL CENTER" "MIZELL MEMORIAL HOSPITAL" ...
## $ State : chr [1:4816] "AL" "AL" "AL" "AL" ...
## $ Payment_PAYM_90_HIP_KNEE : num [1:4816] 22212 18030 21898 NA NA ...
# Create function to find categorical variables
is_categorical <- function(x) is.factor(x) | is.character(x)
# Apply function to all variables in the dataset
categorical_vars <- sapply(HipKneeClean, is_categorical)
# Print the names of all categorical variables
categorical <- names(HipKneeClean)[categorical_vars]
categorical
## [1] "FacilityId" "EDV" "FacilityName" "State"
# Define the encoding mapping (ignore NAs for now)
encoding_map <- c(
'low' = 1,
'medium' = 2,
'high' = 3,
'very high' = 4
)
# Dummy encoding used due to ordinal nature of this data
# Create a copy of HipKneeCleanTest and name it HipKneeTest to separate cleaned dataset and the test dataset
HipKneeTest <- HipKneeCleanTest %>%
mutate(EDV = recode(EDV, !!!encoding_map))
# Print first 20 rows of EDV column in HipKneeClean and HipKneeTrain to ensure proper encoding
cat("HipKneeCleanTest")
## HipKneeCleanTest
print(head(HipKneeCleanTest$EDV, 20))
## [1] "high" "high" "high" "low" "low" "high"
## [7] "low" "medium" "low" "medium" "low" "low"
## [13] "high" "high" "very high" "very high" "low" "high"
## [19] "low" "very high"
cat("HipKneeTest")
## HipKneeTest
print(head(HipKneeTest$EDV, 20))
## [1] 3 3 3 1 1 3 1 2 1 2 1 1 3 3 4 4 1 3 1 4
# Manually map out each state with their respective code in alphabetical order with a preceding 0 to make data non-ordinal
state_mapping <- c(
"AL" = "001",
"AK" = "002",
"AZ" = "003",
"AR" = "004",
"CA" = "005",
"CO" = "006",
"CT" = "007",
"DE" = "008",
"FL" = "009",
"GA" = "010",
"HI" = "011",
"ID" = "012",
"IL" = "013",
"IN" = "014",
"IA" = "015",
"KS" = "016",
"KY" = "017",
"LA" = "018",
"ME" = "019",
"MD" = "020",
"MA" = "021",
"MI" = "022",
"MN" = "023",
"MS" = "024",
"MO" = "025",
"MT" = "026",
"NE" = "027",
"NV" = "028",
"NH" = "029",
"NJ" = "030",
"NM" = "031",
"NY" = "032",
"NC" = "033",
"ND" = "034",
"OH" = "035",
"OK" = "036",
"OR" = "037",
"PA" = "038",
"RI" = "039",
"SC" = "040",
"SD" = "041",
"TN" = "042",
"TX" = "043",
"UT" = "044",
"VT" = "045",
"VA" = "046",
"WA" = "047",
"WV" = "048",
"WI" = "049",
"WY" = "050"
)
# Create new "StateCode" column with the encoded values
HipKneeTest <- HipKneeTest %>%
mutate(StateCode = state_mapping[State])
# Print 100 rows of the "State" and "StateCode" columns to ensure accuracy
print("State and StateCode Columns")
## [1] "State and StateCode Columns"
print(head(HipKneeTest[c("State", "StateCode")], 100))
## # A tibble: 100 × 2
## State StateCode
## <chr> <chr>
## 1 AL 001
## 2 AL 001
## 3 AL 001
## 4 AL 001
## 5 AL 001
## 6 AL 001
## 7 AL 001
## 8 AL 001
## 9 AL 001
## 10 AL 001
## # ℹ 90 more rows
# Print all unique values in "StateCode" column to ensure accuracy
print("Unique StateCode Values")
## [1] "Unique StateCode Values"
print(unique(HipKneeTest$StateCode))
## [1] "001" "002" "003" "004" "005" "006" "007" "008" NA "009" "010" "011"
## [13] "012" "013" "014" "015" "016" "017" "018" "019" "020" "021" "022" "023"
## [25] "024" "025" "026" "027" "028" "029" "030" "031" "032" "033" "034" "035"
## [37] "036" "037" "038" "039" "040" "041" "042" "043" "044" "045" "046" "047"
## [49] "048" "049" "050"
# Compute correlation matrix
cor_matrix <- cor(HipKneeTest %>% select_if(is.numeric), use = "pairwise.complete.obs")
# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)
# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")
# Convert correlation matrix to df
cor_table <- as.data.frame(cor_matrix)
# Add variable names as a column
cor_table$Variable <- rownames(cor_table)
# Reorder columns
cor_table <- cor_table %>%
select(Variable, everything())
# Print table
cor_table %>%
kable(caption = "Table 8. Correlation Coefficients Table") %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
| Variable | PredictedReadmissionRate_HIP-KNEE | HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | EDV | HCP_COVID_19 | IMM_3 | OP_18b | OP_29 | SAFE_USE_OF_OPIOIDS | VTE_1 | Score_COMP_HIP_KNEE | Score_MORT_30_AMI | Score_MORT_30_COPD | Score_MORT_30_HF | Score_MORT_30_PN | Score_MORT_30_STK | Score_PSI_03 | Score_PSI_04 | Score_PSI_06 | Score_PSI_08 | Score_PSI_09 | Score_PSI_10 | Score_PSI_11 | Score_PSI_12 | Score_PSI_13 | Score_PSI_14 | Score_PSI_15 | Payment_PAYM_90_HIP_KNEE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PredictedReadmissionRate_HIP-KNEE | PredictedReadmissionRate_HIP-KNEE | 1.0000000 | -0.2060912 | 0.1986939 | -0.0563082 | -0.0028840 | 0.1295727 | -0.0106510 | 0.1063002 | 0.0654668 | 0.3208550 | 0.0074065 | -0.0794948 | -0.1067828 | -0.0985660 | -0.0376746 | -0.0037334 | -0.0449077 | 0.0154891 | -0.0214412 | -0.0182303 | 0.0710046 | 0.1130121 | 0.1047402 | 0.1193336 | 0.0140012 | -0.0158282 | 0.2975679 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | -0.2060912 | 1.0000000 | -0.2262341 | 0.0302154 | 0.2182505 | -0.2448842 | 0.0865028 | 0.1100002 | -0.0248375 | -0.1091775 | -0.0952111 | -0.0230632 | 0.0300940 | -0.0702915 | -0.0760295 | -0.0499063 | -0.0948401 | -0.0054742 | -0.0590998 | 0.0707774 | -0.0157615 | -0.1615833 | -0.0674588 | -0.1458284 | -0.0172231 | 0.0304634 | -0.2108956 |
| EDV | EDV | 0.1986939 | -0.2262341 | 1.0000000 | 0.1599806 | 0.0038674 | 0.5918897 | 0.0603877 | -0.1223889 | 0.2992859 | -0.0240093 | -0.0687401 | -0.0840621 | -0.2588739 | -0.1904351 | -0.0754281 | 0.0292819 | -0.0438108 | -0.0308989 | -0.1637017 | -0.0181523 | 0.0837980 | 0.0455190 | 0.0749218 | 0.0724936 | 0.0399935 | 0.0144400 | 0.0053946 |
| HCP_COVID_19 | HCP_COVID_19 | -0.0563082 | 0.0302154 | 0.1599806 | 1.0000000 | 0.3203622 | 0.2574291 | 0.1067941 | -0.0812735 | 0.0241622 | -0.0510683 | -0.0869890 | -0.1128278 | -0.1245435 | -0.1523779 | -0.0988833 | 0.0943953 | 0.0417007 | 0.0225916 | -0.0272232 | 0.0549990 | 0.0009222 | -0.0909811 | 0.1091949 | -0.0160408 | 0.0169512 | 0.0430812 | -0.0627505 |
| IMM_3 | IMM_3 | -0.0028840 | 0.2182505 | 0.0038674 | 0.3203622 | 1.0000000 | 0.1105628 | 0.1317922 | 0.0410289 | 0.0906329 | -0.0212916 | -0.0165321 | -0.0616051 | -0.0010634 | -0.0761104 | 0.0146397 | 0.0451508 | 0.0601831 | 0.0544576 | -0.0311579 | 0.0899361 | 0.0412713 | -0.0625676 | 0.0594933 | -0.0250714 | 0.0723872 | 0.0625753 | -0.0720431 |
| OP_18b | OP_18b | 0.1295727 | -0.2448842 | 0.5918897 | 0.2574291 | 0.1105628 | 1.0000000 | 0.0506067 | -0.1400845 | 0.2344307 | -0.0293698 | -0.0678837 | -0.1527290 | -0.2195933 | -0.1858291 | -0.0905644 | 0.0583644 | 0.0638412 | 0.0187794 | -0.0806516 | 0.0224833 | 0.0544528 | -0.0076797 | 0.1653812 | 0.0993570 | 0.0676972 | 0.0374530 | -0.0241965 |
| OP_29 | OP_29 | -0.0106510 | 0.0865028 | 0.0603877 | 0.1067941 | 0.1317922 | 0.0506067 | 1.0000000 | -0.0650231 | 0.1567526 | -0.0096464 | -0.0600569 | 0.0081252 | 0.0099705 | -0.0536184 | -0.0251654 | -0.0032584 | 0.0312780 | 0.0059688 | -0.0199006 | 0.0150699 | 0.0331569 | -0.0837910 | -0.0108546 | -0.0087678 | 0.0163779 | 0.0419068 | -0.0815209 |
| SAFE_USE_OF_OPIOIDS | SAFE_USE_OF_OPIOIDS | 0.1063002 | 0.1100002 | -0.1223889 | -0.0812735 | 0.0410289 | -0.1400845 | -0.0650231 | 1.0000000 | -0.0563373 | -0.0081923 | -0.0643353 | -0.0362573 | 0.0171605 | -0.0204107 | -0.0804707 | -0.0287158 | -0.0995591 | -0.0603629 | -0.0017558 | -0.0043396 | -0.0382369 | 0.0137283 | -0.0380145 | -0.0258097 | -0.0166146 | -0.0268805 | -0.0048449 |
| VTE_1 | VTE_1 | 0.0654668 | -0.0248375 | 0.2992859 | 0.0241622 | 0.0906329 | 0.2344307 | 0.1567526 | -0.0563373 | 1.0000000 | -0.0526925 | -0.0493931 | -0.0282911 | -0.1051171 | -0.1710235 | -0.1238916 | -0.0378500 | -0.1180160 | -0.0262166 | -0.0522876 | -0.0577213 | -0.0037803 | -0.0301775 | -0.0317534 | -0.0440578 | 0.0086141 | 0.0326658 | -0.1256363 |
| Score_COMP_HIP_KNEE | Score_COMP_HIP_KNEE | 0.3208550 | -0.1091775 | -0.0240093 | -0.0510683 | -0.0212916 | -0.0293698 | -0.0096464 | -0.0081923 | -0.0526925 | 1.0000000 | 0.0830479 | -0.0203930 | -0.0007242 | 0.0241066 | 0.0211621 | 0.0498557 | 0.0038509 | 0.0505415 | 0.0577776 | 0.0540124 | 0.0813038 | 0.1279724 | 0.1458258 | 0.1334619 | 0.0498603 | 0.0433809 | 0.3410864 |
| Score_MORT_30_AMI | Score_MORT_30_AMI | 0.0074065 | -0.0952111 | -0.0687401 | -0.0869890 | -0.0165321 | -0.0678837 | -0.0600569 | -0.0643353 | -0.0493931 | 0.0830479 | 1.0000000 | 0.2498600 | 0.3407616 | 0.3309425 | 0.2222539 | 0.0415523 | 0.2105379 | 0.0885083 | 0.1010348 | 0.0889343 | 0.1066619 | 0.1037006 | 0.0492328 | 0.0467554 | 0.0454462 | 0.0297688 | 0.0591548 |
| Score_MORT_30_COPD | Score_MORT_30_COPD | -0.0794948 | -0.0230632 | -0.0840621 | -0.1128278 | -0.0616051 | -0.1527290 | 0.0081252 | -0.0362573 | -0.0282911 | -0.0203930 | 0.2498600 | 1.0000000 | 0.3844105 | 0.3710744 | 0.2038243 | -0.0069743 | 0.1713379 | 0.0478268 | 0.0397571 | 0.0429090 | 0.0320669 | 0.0426574 | -0.0532586 | 0.0026944 | 0.0734846 | 0.0340007 | -0.0406696 |
| Score_MORT_30_HF | Score_MORT_30_HF | -0.1067828 | 0.0300940 | -0.2588739 | -0.1245435 | -0.0010634 | -0.2195933 | 0.0099705 | 0.0171605 | -0.1051171 | -0.0007242 | 0.3407616 | 0.3844105 | 1.0000000 | 0.4479367 | 0.3147371 | 0.0371596 | 0.2556384 | 0.0679149 | 0.1051698 | 0.0707269 | 0.0383771 | 0.0362529 | -0.0300702 | -0.0086832 | 0.0647245 | 0.0342374 | -0.0350247 |
| Score_MORT_30_PN | Score_MORT_30_PN | -0.0985660 | -0.0702915 | -0.1904351 | -0.1523779 | -0.0761104 | -0.1858291 | -0.0536184 | -0.0204107 | -0.1710235 | 0.0241066 | 0.3309425 | 0.3710744 | 0.4479367 | 1.0000000 | 0.3042563 | 0.0303815 | 0.2301195 | 0.0543554 | 0.0884315 | 0.0217880 | 0.0237048 | 0.0704445 | 0.0089560 | 0.0393676 | 0.0464407 | 0.0029691 | -0.0062985 |
| Score_MORT_30_STK | Score_MORT_30_STK | -0.0376746 | -0.0760295 | -0.0754281 | -0.0988833 | 0.0146397 | -0.0905644 | -0.0251654 | -0.0804707 | -0.1238916 | 0.0211621 | 0.2222539 | 0.2038243 | 0.3147371 | 0.3042563 | 1.0000000 | 0.0687216 | 0.2380935 | 0.0878847 | 0.1014879 | 0.0674377 | 0.0622532 | 0.0725381 | 0.0474896 | 0.0513975 | 0.0492194 | 0.0625191 | -0.0272101 |
| Score_PSI_03 | Score_PSI_03 | -0.0037334 | -0.0499063 | 0.0292819 | 0.0943953 | 0.0451508 | 0.0583644 | -0.0032584 | -0.0287158 | -0.0378500 | 0.0498557 | 0.0415523 | -0.0069743 | 0.0371596 | 0.0303815 | 0.0687216 | 1.0000000 | 0.1353085 | 0.0601750 | 0.0636661 | 0.1407342 | 0.0386211 | 0.0114365 | 0.1186788 | 0.0298580 | 0.0596798 | 0.0999683 | 0.0086745 |
| Score_PSI_04 | Score_PSI_04 | -0.0449077 | -0.0948401 | -0.0438108 | 0.0417007 | 0.0601831 | 0.0638412 | 0.0312780 | -0.0995591 | -0.1180160 | 0.0038509 | 0.2105379 | 0.1713379 | 0.2556384 | 0.2301195 | 0.2380935 | 0.1353085 | 1.0000000 | 0.0601419 | 0.0870693 | 0.1059485 | 0.0523892 | 0.0649032 | 0.0782559 | 0.0123489 | 0.0652098 | 0.1018205 | -0.0766302 |
| Score_PSI_06 | Score_PSI_06 | 0.0154891 | -0.0054742 | -0.0308989 | 0.0225916 | 0.0544576 | 0.0187794 | 0.0059688 | -0.0603629 | -0.0262166 | 0.0505415 | 0.0885083 | 0.0478268 | 0.0679149 | 0.0543554 | 0.0878847 | 0.0601750 | 0.0601419 | 1.0000000 | 0.0724291 | 0.1014588 | 0.0516246 | 0.0351464 | 0.1431056 | 0.0509831 | 0.0527115 | 0.0910520 | 0.0456525 |
| Score_PSI_08 | Score_PSI_08 | -0.0214412 | -0.0590998 | -0.1637017 | -0.0272232 | -0.0311579 | -0.0806516 | -0.0199006 | -0.0017558 | -0.0522876 | 0.0577776 | 0.1010348 | 0.0397571 | 0.1051698 | 0.0884315 | 0.1014879 | 0.0636661 | 0.0870693 | 0.0724291 | 1.0000000 | 0.0052449 | -0.0360093 | 0.0198090 | 0.0394605 | 0.0093444 | 0.0228045 | 0.0127268 | -0.0041983 |
| Score_PSI_09 | Score_PSI_09 | -0.0182303 | 0.0707774 | -0.0181523 | 0.0549990 | 0.0899361 | 0.0224833 | 0.0150699 | -0.0043396 | -0.0577213 | 0.0540124 | 0.0889343 | 0.0429090 | 0.0707269 | 0.0217880 | 0.0674377 | 0.1407342 | 0.1059485 | 0.1014588 | 0.0052449 | 1.0000000 | 0.0885278 | 0.0680540 | 0.1732337 | 0.0519119 | 0.1207438 | 0.2197254 | -0.0237660 |
| Score_PSI_10 | Score_PSI_10 | 0.0710046 | -0.0157615 | 0.0837980 | 0.0009222 | 0.0412713 | 0.0544528 | 0.0331569 | -0.0382369 | -0.0037803 | 0.0813038 | 0.1066619 | 0.0320669 | 0.0383771 | 0.0237048 | 0.0622532 | 0.0386211 | 0.0523892 | 0.0516246 | -0.0360093 | 0.0885278 | 1.0000000 | 0.1626632 | 0.1079488 | 0.2303938 | 0.0453739 | 0.0830134 | 0.0497447 |
| Score_PSI_11 | Score_PSI_11 | 0.1130121 | -0.1615833 | 0.0455190 | -0.0909811 | -0.0625676 | -0.0076797 | -0.0837910 | 0.0137283 | -0.0301775 | 0.1279724 | 0.1037006 | 0.0426574 | 0.0362529 | 0.0704445 | 0.0725381 | 0.0114365 | 0.0649032 | 0.0351464 | 0.0198090 | 0.0680540 | 0.1626632 | 1.0000000 | 0.1172504 | 0.2506376 | -0.0093577 | 0.0464067 | 0.1441986 |
| Score_PSI_12 | Score_PSI_12 | 0.1047402 | -0.0674588 | 0.0749218 | 0.1091949 | 0.0594933 | 0.1653812 | -0.0108546 | -0.0380145 | -0.0317534 | 0.1458258 | 0.0492328 | -0.0532586 | -0.0300702 | 0.0089560 | 0.0474896 | 0.1186788 | 0.0782559 | 0.1431056 | 0.0394605 | 0.1732337 | 0.1079488 | 0.1172504 | 1.0000000 | 0.1742084 | 0.0522204 | 0.1358951 | 0.0655557 |
| Score_PSI_13 | Score_PSI_13 | 0.1193336 | -0.1458284 | 0.0724936 | -0.0160408 | -0.0250714 | 0.0993570 | -0.0087678 | -0.0258097 | -0.0440578 | 0.1334619 | 0.0467554 | 0.0026944 | -0.0086832 | 0.0393676 | 0.0513975 | 0.0298580 | 0.0123489 | 0.0509831 | 0.0093444 | 0.0519119 | 0.2303938 | 0.2506376 | 0.1742084 | 1.0000000 | 0.0056987 | 0.0878105 | 0.0949467 |
| Score_PSI_14 | Score_PSI_14 | 0.0140012 | -0.0172231 | 0.0399935 | 0.0169512 | 0.0723872 | 0.0676972 | 0.0163779 | -0.0166146 | 0.0086141 | 0.0498603 | 0.0454462 | 0.0734846 | 0.0647245 | 0.0464407 | 0.0492194 | 0.0596798 | 0.0652098 | 0.0527115 | 0.0228045 | 0.1207438 | 0.0453739 | -0.0093577 | 0.0522204 | 0.0056987 | 1.0000000 | 0.1176726 | -0.0181150 |
| Score_PSI_15 | Score_PSI_15 | -0.0158282 | 0.0304634 | 0.0144400 | 0.0430812 | 0.0625753 | 0.0374530 | 0.0419068 | -0.0268805 | 0.0326658 | 0.0433809 | 0.0297688 | 0.0340007 | 0.0342374 | 0.0029691 | 0.0625191 | 0.0999683 | 0.1018205 | 0.0910520 | 0.0127268 | 0.2197254 | 0.0830134 | 0.0464067 | 0.1358951 | 0.0878105 | 0.1176726 | 1.0000000 | -0.0467071 |
| Payment_PAYM_90_HIP_KNEE | Payment_PAYM_90_HIP_KNEE | 0.2975679 | -0.2108956 | 0.0053946 | -0.0627505 | -0.0720431 | -0.0241965 | -0.0815209 | -0.0048449 | -0.1256363 | 0.3410864 | 0.0591548 | -0.0406696 | -0.0350247 | -0.0062985 | -0.0272101 | 0.0086745 | -0.0766302 | 0.0456525 | -0.0041983 | -0.0237660 | 0.0497447 | 0.1441986 | 0.0655557 | 0.0949467 | -0.0181150 | -0.0467071 | 1.0000000 |
# Change - to _ in HIP-KNEE
colnames(HipKneeTest) <- gsub("-", "_", colnames(HipKneeTest))
# Remove all NA values in target variable "PredictedReadmissionRate_HIP_KNEE"
HipKneeTest <- HipKneeTest %>% filter(!is.na(PredictedReadmissionRate_HIP_KNEE))
# Remove all NA values in the "State", "StateCode", and "FacilityName" columns
HipKneeTest <- HipKneeTest %>% drop_na(State, StateCode, FacilityName)
# Print number of remaining variables and observations
dimensions <- dim(HipKneeTest)
cat("Number of variables:", dimensions[2], "\n")
## Number of variables: 31
cat("Number of observations:", dimensions[1], "\n")
## Number of observations: 1833
# Calculate missing values
missing_values_summary <- HipKneeTest %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTest)) * 100)
# Print table
missing_values_summary %>%
kable(caption = "Table 7. Missing Values Summary") %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
| Variable | Missing_Count | Missing_Percentage |
|---|---|---|
| FacilityId | 0 | 0.0000000 |
| PredictedReadmissionRate_HIP_KNEE | 0 | 0.0000000 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | 33 | 1.8003273 |
| EDV | 90 | 4.9099836 |
| HCP_COVID_19 | 16 | 0.8728860 |
| IMM_3 | 16 | 0.8728860 |
| OP_18b | 75 | 4.0916530 |
| OP_29 | 222 | 12.1112930 |
| SAFE_USE_OF_OPIOIDS | 69 | 3.7643208 |
| VTE_1 | 994 | 54.2280415 |
| Score_COMP_HIP_KNEE | 40 | 2.1822149 |
| Score_MORT_30_AMI | 405 | 22.0949264 |
| Score_MORT_30_COPD | 247 | 13.4751773 |
| Score_MORT_30_HF | 141 | 7.6923077 |
| Score_MORT_30_PN | 125 | 6.8194217 |
| Score_MORT_30_STK | 284 | 15.4937261 |
| Score_PSI_03 | 8 | 0.4364430 |
| Score_PSI_04 | 575 | 31.3693399 |
| Score_PSI_06 | 2 | 0.1091107 |
| Score_PSI_08 | 2 | 0.1091107 |
| Score_PSI_09 | 2 | 0.1091107 |
| Score_PSI_10 | 41 | 2.2367703 |
| Score_PSI_11 | 40 | 2.1822149 |
| Score_PSI_12 | 2 | 0.1091107 |
| Score_PSI_13 | 42 | 2.2913257 |
| Score_PSI_14 | 87 | 4.7463175 |
| Score_PSI_15 | 29 | 1.5821058 |
| FacilityName | 0 | 0.0000000 |
| State | 0 | 0.0000000 |
| Payment_PAYM_90_HIP_KNEE | 42 | 2.2913257 |
| StateCode | 0 | 0.0000000 |
# Calculate median for columns with <5% missing values
numeric_vars_low_missing <- c("HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE", "EDV", "HCP_COVID_19", "IMM_3", "OP_18b", "SAFE_USE_OF_OPIOIDS", "Score_COMP_HIP_KNEE", "Score_PSI_03", "Score_PSI_06", "Score_PSI_08", "Score_PSI_09", "Score_PSI_10", "Score_PSI_11", "Score_PSI_12", "Score_PSI_13", "Score_PSI_14", "Score_PSI_15", "Payment_PAYM_90_HIP_KNEE")
for (var in numeric_vars_low_missing) {
HipKneeTest[[var]][is.na(HipKneeTest[[var]])] <- median(HipKneeTest[[var]], na.rm = TRUE)
}
# Select high missingness variables for KNN imputation
vars_for_knn <- c("VTE_1", "Score_MORT_30_AMI", "Score_MORT_30_COPD", "Score_MORT_30_HF", "Score_MORT_30_PN", "Score_MORT_30_STK", "Score_PSI_04", "OP_29")
# Perform KNN imputation
HipKneeTest_knn <- kNN(HipKneeTest, variable = vars_for_knn, k = 5)
# Remove columns created by the KNN function
HipKneeTest_knn <- HipKneeTest_knn %>% select(-ends_with("_imp"))
# Update HipKneeTrain with imputed values
HipKneeTest[vars_for_knn] <- HipKneeTest_knn[vars_for_knn]
# Calculate missing values
missing_values_summary <- HipKneeTest %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Percentage = (Missing_Count / nrow(HipKneeTest)) * 100)
# Print table
missing_values_summary %>%
kable(caption = "Table 7. Missing Values Summary") %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
| Variable | Missing_Count | Missing_Percentage |
|---|---|---|
| FacilityId | 0 | 0 |
| PredictedReadmissionRate_HIP_KNEE | 0 | 0 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | 0 | 0 |
| EDV | 0 | 0 |
| HCP_COVID_19 | 0 | 0 |
| IMM_3 | 0 | 0 |
| OP_18b | 0 | 0 |
| OP_29 | 0 | 0 |
| SAFE_USE_OF_OPIOIDS | 0 | 0 |
| VTE_1 | 0 | 0 |
| Score_COMP_HIP_KNEE | 0 | 0 |
| Score_MORT_30_AMI | 0 | 0 |
| Score_MORT_30_COPD | 0 | 0 |
| Score_MORT_30_HF | 0 | 0 |
| Score_MORT_30_PN | 0 | 0 |
| Score_MORT_30_STK | 0 | 0 |
| Score_PSI_03 | 0 | 0 |
| Score_PSI_04 | 0 | 0 |
| Score_PSI_06 | 0 | 0 |
| Score_PSI_08 | 0 | 0 |
| Score_PSI_09 | 0 | 0 |
| Score_PSI_10 | 0 | 0 |
| Score_PSI_11 | 0 | 0 |
| Score_PSI_12 | 0 | 0 |
| Score_PSI_13 | 0 | 0 |
| Score_PSI_14 | 0 | 0 |
| Score_PSI_15 | 0 | 0 |
| FacilityName | 0 | 0 |
| State | 0 | 0 |
| Payment_PAYM_90_HIP_KNEE | 0 | 0 |
| StateCode | 0 | 0 |
# Average death rates amongst mortality variables and create new column "Score_Ovr_MORT"
HipKneeTest$Score_Ovr_MORT <- rowMeans(HipKneeTest[, c("Score_MORT_30_AMI",
"Score_MORT_30_COPD",
"Score_MORT_30_HF",
"Score_MORT_30_PN",
"Score_MORT_30_STK")],
na.rm = TRUE)
# Remove old mortality columns
HipKneeTest <- HipKneeTest[, !(names(HipKneeTest) %in% c("Score_MORT_30_AMI",
"Score_MORT_30_COPD",
"Score_MORT_30_HF",
"Score_MORT_30_PN",
"Score_MORT_30_STK"))]
# Compute correlation matrix
cor_matrix <- cor(HipKneeTest %>% select_if(is.numeric), use = "pairwise.complete.obs")
# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)
# Plot heatmap
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Figure 5. Correlation Heatmap of Numeric Variables")
save(HipKneeTest, file = "HipKneeTest.RData")
# Create a summary table of descriptive statistics
descr_stats <- describe(HipKneeTrain)
# Remove the rows with Facility ID, State and State code, and facility name
descr_stats <- descr_stats %>% filter(vars != c(1, 23, 24, 26))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `vars != c(1, 23, 24, 26)`.
## Caused by warning in `vars != c(1, 23, 24, 26)`:
## ! longer object length is not a multiple of shorter object length
# Remove columns 1, 2, 5, and 6
descr_stats <- descr_stats[, -c(1, 2, 5, 6)]
# Create a table with kable
kable(descr_stats, format = "html", caption = "Descriptive Statistics for All Numeric Variables in Final Dataset") %>%
kable_styling(
bootstrap_options = c("hover", "striped", "responsive")
) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, width = "5em") %>%
row_spec(0, bold = TRUE, background = "#f2f2f2")
| mean | sd | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|
| PredictedReadmissionRate_HIP_KNEE | 4.546561e+00 | 0.9093914 | 0.8576841 | 1.9279 | 8.569 | 6.6411 | 0.4373061 | 0.4908058 | 0.0212407 |
| HcahpsLinearMeanValue_H_HSP_RATING_LINEAR_SCORE | 8.672450e+01 | 3.7230744 | 2.9652000 | 66.0000 | 98.000 | 32.0000 | -0.5851721 | 1.5050011 | 0.0869602 |
| EDV | 2.694490e+00 | 1.0250451 | 1.4826000 | 1.0000 | 4.000 | 3.0000 | -0.1385316 | -1.1578643 | 0.0239421 |
| HCP_COVID_19 | 8.887098e+01 | 9.5606361 | 8.3025600 | 32.6000 | 100.000 | 67.4000 | -1.3148000 | 2.1219028 | 0.2233087 |
| IMM_3 | 7.932570e+01 | 17.7478481 | 16.3086000 | 4.0000 | 100.000 | 96.0000 | -1.0958217 | 0.7241943 | 0.4145381 |
| OP_18b | 1.895739e+02 | 49.6783664 | 44.4780000 | 62.0000 | 587.000 | 525.0000 | 0.9817063 | 3.3080181 | 1.1603422 |
| OP_29 | 9.215876e+01 | 11.7440127 | 4.4478000 | 0.0000 | 100.000 | 100.0000 | -3.3057142 | 14.4641144 | 0.2743060 |
| SAFE_USE_OF_OPIOIDS | 1.557229e+01 | 4.2502679 | 2.9652000 | 0.0000 | 45.000 | 45.0000 | 0.5890847 | 3.1474800 | 0.0992739 |
| VTE_1 | 9.107365e+01 | 9.1561257 | 4.4478000 | 5.0000 | 100.000 | 95.0000 | -4.3773341 | 29.2741519 | 0.2138605 |
| Score_COMP_HIP_KNEE | 3.179924e+00 | 0.5539759 | 0.4447800 | 1.6000 | 6.200 | 4.6000 | 0.7280978 | 1.7214783 | 0.0129393 |
| Score_PSI_03 | 5.878123e-01 | 0.5183737 | 0.2816940 | 0.0500 | 6.310 | 6.2600 | 3.5027779 | 21.8351982 | 0.0121077 |
| Score_PSI_04 | 1.688142e+02 | 19.0239898 | 15.6117780 | 86.6800 | 241.810 | 155.1300 | -0.0656641 | 1.2542628 | 0.4443451 |
| Score_PSI_06 | 2.467703e-01 | 0.0452557 | 0.0296520 | 0.1200 | 0.510 | 0.3900 | 0.9085914 | 1.9551905 | 0.0010570 |
| Score_PSI_08 | 8.994540e-02 | 0.0082004 | 0.0000000 | 0.0600 | 0.130 | 0.0700 | 0.5502254 | 1.7580966 | 0.0001915 |
| Score_PSI_09 | 2.514146e+00 | 0.4953103 | 0.3409980 | 1.1000 | 6.100 | 5.0000 | 1.2648156 | 4.8211377 | 0.0115690 |
| Score_PSI_10 | 1.572951e+00 | 0.3731798 | 0.1482600 | 0.4700 | 4.550 | 4.0800 | 1.8287437 | 7.4056616 | 0.0087164 |
| Score_PSI_11 | 8.969864e+00 | 3.1123203 | 2.2239000 | 2.7300 | 49.000 | 46.2700 | 2.7479837 | 22.3393252 | 0.0726947 |
| Score_PSI_12 | 3.582755e+00 | 0.7944605 | 0.6671700 | 1.6100 | 7.510 | 5.9000 | 0.9512312 | 1.6446910 | 0.0185563 |
| Score_PSI_13 | 5.287103e+00 | 1.0312666 | 0.7413000 | 2.1700 | 10.790 | 8.6200 | 1.0438233 | 2.7498825 | 0.0240874 |
| Score_PSI_14 | 2.010540e+00 | 0.3569335 | 0.1927380 | 1.0700 | 4.400 | 3.3300 | 1.9129489 | 6.4857308 | 0.0083369 |
| Score_PSI_15 | 1.102668e+00 | 0.3271882 | 0.2223900 | 0.3500 | 3.430 | 3.0800 | 1.7127063 | 5.4101230 | 0.0076422 |
| FacilityName* | 9.026263e+02 | 517.8533090 | 667.1700000 | 1.0000 | 1796.000 | 1795.0000 | -0.0093756 | -1.2029068 | 12.0955473 |
| State* | 2.469449e+01 | 14.4448000 | 20.7564000 | 1.0000 | 50.000 | 49.0000 | 0.0222112 | -1.3657016 | 0.3373885 |
| Payment_PAYM_90_HIP_KNEE | 2.105666e+04 | 1943.7230953 | 1716.8508000 | 15936.0000 | 34916.000 | 18980.0000 | 0.7645293 | 1.9173982 | 45.3997188 |
| StateCode* | 2.472995e+01 | 14.5368326 | 20.7564000 | 1.0000 | 50.000 | 49.0000 | 0.0160810 | -1.3497989 | 0.3395381 |
| Score_Ovr_MORT | 1.305795e+01 | 1.1615024 | 1.0378200 | 8.1600 | 17.420 | 9.2600 | -0.0667113 | 0.5511229 | 0.0271293 |
# Select numeric columns
numeric_columns <- HipKneeTrain %>% select_if(is.numeric)
# Melt the data for easier plotting with ggplot2
numeric_melted <- melt(numeric_columns)
## No id variables; using all as measure variables
# Create histograms
ggplot(numeric_melted, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
facet_wrap(~variable, scales = "free_x") +
theme_minimal() +
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")